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ABSTRACT 


Matching  is  defined  as  the  methodology  of  merging  micro-data  files  to 
create  larger  files  of  data.  Matching  is  often  done  to  extract  statistical 
information  which  cannot  be  obtained  from  the  individual  files  that  are 
incomplete.  Current  federal  statistical  practice  involving  multivariate 
file-merging  techniques  is  typically  not  based  on  a  formal  statistical 
theory.  In  view  of  this  situation,  a  survey  on  matching  is  given.  All  known 
models  for  matching  are  presented  under  a  unified  framework,  which  consists 
of  three  f ituations  involving  the  same  or  similar  individuals. 

The  properties  of  a  maximum  likelihood  strategy  to  match  files  of  data 
involving  the  same  individuals  are  derived  via  ranks  and  order-statistics 
from  bivariate  populations.  In  addition,  the  properties  of  this  strategy 
have  been  examined  with  respect  to  a  more  reasonable  criterion  called 
epsilon-correct  matching.  Asymptotic  results  for  such  situations,  including 
(i>  the  Poisson  approximation  for  the  distribution  of  the  number  of  correct 
matches,  and  <{ii)  convergence  in  probability  of  the  average  number  of 
epsilon-corect  matches,  have  been  derived.  Small-sample  properties,  like  the 
monotone  behavior  of  the  expected  number  of  matches  with  respect  to  the 
dependence  of  parameters  of  the  underlying  models,  have  been  proved. 
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Two  matching  strategies  due  to  Kadane  (1978)  and  one  strategy  due  to 
Sims  (1978)  for  merging  files  of  data  on  similar  individuals  are  discussed. 
These  strategies  are  evaluated  via  a  Monte-Carlo  study  of  matching  models 
involving  trivariate  normal  distributions. 
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[ .  INTRODUCTION 


One  of  the  most  Important  tools  for  analyzing  economic  policies 
is  the  micro-analytic  model.  This  technique  is  used  frequently  in 
public  decision-making  centers.  Virtually  every  Federal  Agency  uses 
micro-analytic  models  for  the  evaluation  of  policy  proposals. 

Direct  use  of  sample  observations  rather  than  aggregated  data 
is  characteristic  of  the  micro-analytic  approach.  For  this  reason, 
the  micro-data  that  is  used  as  input  to  the  model  has  a  significant 
bearing  on  the  validity  of  the  results  of  the  model.  Furthermore, 
when  all  the  input-data  come  from  a  single  sample,  the  quality  of  the 
model  depends  on,  among  others,  sampling  and  data- recording  proce¬ 
dures.  However,  if  the  data  from  a  single  source  is  Insufficient  or 
partly  aggregated,  then  typically  multiple  sources  of  data  are  used 
to  provide  the  necessary  input  to  the  model.  At  the  same  time, 
issues  such  as  validity  and  quality  of  the  results  of  the  model 
cannot  be  assessed  as  easily  as  when  we  have  a  single  source  of  data 
as  input.  In  such  situations,  government  statisticians  have  been 
using  a  methodology  in  which  multiple  sources  of  data  are  merged  to 
form  a  composite  data- file.  Effective  use  of  the  different  pieces  of 
data  in  order  to  produce  sensible  but  more  comprehensive  files  is  a 
fundamental  issue  in  the  file-merging  methodology. 


Some  of  the  difficulties  associated  with  the  merging  procedures 
and  techniques  for  their  resolution  have  been  known  for  quite  some¬ 
time.  Initiated  by  the  Federal  Subcommittee  on  Matching  Techniques, 
there  has  recently  been  renewed  effort  to  establish  solid  theoretical 
foundation  and  empirical  Justification  for  the  file-merging  method 
ology.  This  research  reviews  the  relevant  literature  and  then  pre¬ 
sents  new  statistical  properties  of  some  known  procedures  for  merging 
data-files.  We  shall  now  give  an  example  of  a  typical  situation  in 
which  merging  of  two  files  is  carried  out. 

1  - 1  A  Paradigm 

A  micro  economic  model  in  heavy  use  at  the  Office  of  Tax 
Analysis  (OTA),  Department  of  the  Treasury,  Is  the  Federal  Personal 
Income  Tax  Model.  This  model  Is  used  to  assess  proposed  tax  law 
changes  in  terms  of  their  effects  on  the  distribution  of  after  tax 
Income,  the  efficiency  with  which  the  changes  will  operate  In 
achieving  their  objectives,  etc.  The  Inputs  for  this  model  are  two 
sources  of  micro-data,  namely  the  Statistics  of  Income  File  (SOI) 
and  the  Current  Population  Survey  (CPS).  The  SOI  file  is  generated 
annually  by  the  Internal  Revenue  Service  (IRS)  and  it  consists  of 
personal  tax  return  data.  The  CPS  file  is  produced  monthly  by  the 
Bureau  of  the  Census.  As  we  will  explain  in  Section  1.2,  such 
pooling  of  data  from  more  than  one  Federal  Agency  has  been  severely 
restricted  in  recent  years  by,  among  others,  confidentiality  issues 
such  as  the  privacy  of  the  individuals  Involved  lr  the  aforementioned 


files  of  data.  For  this  reason,  complete  information,  especially 
identifiers  such  as  social  security  numbers,  is  typically  not 
released  by  the  IRS  and  the  Census  Bureau.  The  resulting  micro-data 
files  are  compromises  between  complete  Census  files  and  fully  aggre¬ 
gated  data-sets.  Thus,  sufficient  detail  remains  to  support  micro¬ 
analysis  of  the  population,  while  partial  aggregation  protects 
individual  privacy  and  greatly  diminishes  computational  burden. 

A  typical  problem  In  tax-policy  evaluation  occurs  when  no  single 
available  data  file  such  as  SOI  or  CPS  contains  all  the  information 
needed  for  an  analysis.  For  example,  consider  the  variables 
W  .  (X,Y,Z  ,Z  > .  where 

X  -  Allowable  Itemizations  and  capital  gains 

Y  =  Old  Age  Survivors  Disability  Insurance  (OASDI) 

Z ^  =  social  security  number 

Z?  -  Marital  status 

Suppose  that  we  are  Interested  In  estimating  a  simple  correlation 
y  between  X  and  Y  or,  more  generally,  the  expectation  of  a  known 
function  g,  say,  of  W;  that  is  the  integral 

Y  =  J  g(w>  dF(w)  (1.1.1) 

where  F(w)  is  the  joint  distribution  function  of  the  variables  in  w. 
Now,  the  SOI  microdata  file  cannot  be  used  in  its  original  form  since 
it  doe3  not  Include  the  OASDI  benefits  (Y).  Census  files  (CPS)  with 
OASDI  benefits  do  not  allow  a  complete  analysis  of  the  effect  of 
including  this  benefit,  since-  it  does  not  contain  information  on 
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allowable  Itemizations  and  capital  gains  (X).  Thus,  instead  of 
observing  X,Y,Z1,Z2  Jointly  on  the  same  units,  we  have  to  get  only 
the  following  pair  of  files: 

File  1  (SOI):  X,Z1,Z2 

and 

File  2  (CPS):  Y, 

Estimating  y  based  on  the  fragmetary  data  provided  by  File  1  and 
File  2  is  an  important  practical  problem  that  has  not  yet  been  solved 
satisfactorily.  In  an  attempt  to  cope  with  situations  such  as  the 
OTA  model.  Federal  Agencies  have  long  been  using  procedures  for 
matching  or  merging  the  two  incomplete  files  so  that  one  can  do  the 
usual  inference  for  y,  hoping  that  the  merged  file  is  a  reasonable 
substitute  for  the  unobserved  data  on  (X.Y.Z^Z  ). 

The  reporting  units  in  CPS  are  households.  In  general,  the 
units  in  a  file  may  refer  to  other  types  of  legal  persons,  like 
corporations:  partnerships  and  fiduciaries.  The  term  "individual” 
will  be  used  as  a  generic  label  in  this  thesis  to  refer  to  the 
reporting  units  of  the  micro-data  files. 

1 . 2  A  Dichotomy  of  Matching  Problems 

Roughly  speaking,  there  are  two  different  categories  of  matching 
problem.  The  first  category  consists  of  problems  of  exact  matching 
in  which  it  is  desired  to  Identify  pairs  of  records  in  the  two  files 
that  pertain  to  the  same  individual.  Accurate  information  on  identi¬ 
fiers  such  as  social  security  number,  name,  address  are  assumed  to  be 


available  when  exact-matching  the  two  files.  It  is  clear  that  all  we 
need  to  c^rry  out  an  exact  match  of  two  files  is,  among  other  tools, 
an  efficient  software  to  sort  the  individuals  by  their  identifiers. 
With  the  help  of  such  software,  we  can,  within  reasonable  error,  link 
a  given  individual  in  File  1  with  an  individual  in  File  2  such  that 
these  two  units  possess  the  same  values  for  the  identifiers.  The 
resulting  merged  file  contains  data  which  are  more  comprehensive  than 
both  File  1  and  File  2.  Also,  even  after  merging,  most  records  will 
pertain  to  the  same  Individual,  the  number  of  erroneous  matches  in 
the  enlarged  file  depending  on  the  particular  software  used  in  the 
process  of  merging.  It  is  clear  that,  if  accurate  identifiers  are 
available  for  the  units  in  the  two  files,  then  no  statistical  issues 
are  involved  In  the  matching  methodology  and  we  shall  not  discuss 
this  type  of  problem  any  more.  However,  one  may  refer  to,  among 
others,  Fellegi  and  Sunter  (1569)  and  Radner  et  al .  (1980)  for  work 
related  to  the  exact  matching  methodology.  We  shall  close  our 
discussion  of  this  type  of  matching  problem  by  noting  some  of  the 
reasons  why  exact  matching  of  files  Is  often  not  possible. 

First,  over  the  past  several  years,  there  have  been  significant 
changes  in  the  laws  and  regulations  pertinent  to  exact  matching  of 
records  for  statistical  and  research  purposes.  New  laws,  especially 
the  Privacy  Act  of  1974  and  the  Tax  Reform  Act  of  1976,  have  imposed 
additional  restrictions  on  the  matching  of  records  belonging  tc  more 
than  one  Federal  Agency  and  on  the  matching  of  file."  of  Federal 
Agencies  with  those  of  other  organizations 


As  a  result  of  these 


laws,  some  Agencies  have  limited  access  to  their  records  for  statls 
tical  purposes  tc  an  even  greater  extent  than  seeni3  necessary  by 
statutory  requirements. 

Second,  analyses  of  microdata  often  Involve  data  from  units  that 
are  not  available  from  a  single  source  but  are  available  from  several 
sources.  For  example,  suppose  that  one  is  Interested  In  the  relation¬ 
ships  among  two  sets  of  variables,  one  set.  consisting  of  Information 
about  health  care  expenses  Incurred  by  individuals  and  the  other  set 
consisting  of  Information  about  receipt  of  various  types  of  welfare 
benefits.  Suppose  further  that  no  existing  data  file  contains  all  of 
the  needed  variables,  but  that,  two  samples  of  a  target  population, 
which  come  from  two  different  surveys,  together  contain  all  these 
variables.  If  executing  a  new  survey  to  obtain  all  the  variables 
from  a  single  sample  Is  not  feasible,  then  one  might  match  the  two 
samples  and  use  the  merged  file  for  statistical  analyses  of  variables 
which  are  not  present  In  the  same  sample.  Note  that  the  two  sample 
surveys  may  have  information  on  the  same  individuals  whose  iden¬ 
tities  are  either  unknown  or  unreliable.  However,  In  the  afore¬ 
mentioned  example,  it  is  more  appropriate  to  assume  that  the  two 
samples  contain  very  few  or  no  individuals  in  common.  In  case  the 
two  samples  are  stochastically  independent,  we  shall  describe  the 
units  in  the  two  samples  as  similar  individuals. 

Suppose,  then,  that  exact  matching  is  not  feasible  in  view  of 
the  aforementioned  reasons,  Then  the  tools  that  are  used  in  the 
exact  matching  methodology  are  inadequate  for  the  purpose  of  merging 


the  two  files  of  data.  In  particular,*  identifiers  are  practically 
useless.  However,  the  probabilistic  structure  of  the  populations 
that  generate  the  data  In  the  two  files  or  other  statistical 
techniques  can  often  be  used  to  combine  the  two  files.  Such  proce¬ 
dures  will  be  called  statistical  matching  strategies. 

In  the  literature  on  matching  files  there  is  no  consensus  on 
rigid  definitions  of  Exact  Match  and  Statistical  Match.  Indeed,  It 
is  traditional  to  distinguish  these  two  types  of  problem  by  verify¬ 
ing  whether  same  (exact)  or  similar  (statistical)  Individuals  are  In 
the  two  flies.  Our  classification  of  matching  problems  is  somewhat 
different  from  the  usual  practice  in  the  sense  that  any  procedure 
for  merging  files,  which  may  contain  the  same  or  similar  individuals, 
will  be  described  as  a  statistical  match  if  statistical  techniques 
are  involved  in  the  process  of  merging.  This  convention  is  In  agree¬ 
ment  with  that  of  Woodbury  (1983),  who  describes  certain  matching 
problems  Involving  the  same  individuals  in  two  files  as  "Statistical 
Record  Matching  for  Files”. 

1 . 3  A  General  Set-up  for  Statistical  Matching 
Consider  a  universe  y/  of  individuals.  Let  X,  Y,  Z  denote  three 
groups  of  random  variables  and  let  us  assume  that  we  cannot  observe 
the  vector  V  =  (X,Y,Z)  for  any  unit  In  V/.  However,  suppose  that  the 
following  data  are  available: 

(Base)  File  1:  individuals,  each  with  information  on  a 

function  W^f  say,  of  V. 


a 


and  (Supplementary)  File  2:  individuals,  each  with  information 

on  a  function,  WjJ ,  say,  of  W. 

Various  matching  problems  arise  depending  on  what  type  of  data  are  in 
Wj  and  W" .  We  distinguish  only  three  different  situations: 

Case  I :  =  X  and  W£  -  Y;  we  also  assume  that  the  two  files 

contain  the  same  individuals. 

Case  II:  Let  \1“  =  (X.Z),  W"  =  (Y,Z).  As  in  Case  I,  we  further 

assume  that  the  two  files  contain  the  same  individuals. 

Case  III:  Let  W”  =  (X.Z),  W“  =  (Y.Z).  Unlike  in  Cases  I  and  II,  we 
assume  that  the  two  files  contain  similar  individuals. 

1 ■ **  The  Matching  Methodology  - 
Some  Important  Steps 

We  shall  now  mention  some  steps  involved  in  actually  creating  a 
statistical  match  between  two  given  files.  First,  If  the  populations 
represented  by  the  files  differ,  a  "universe  adjustment”  is  carried 
out  to  ensure  that  there  is  a  common  universe  //  from  which  the  indl 
viduals  of  the  two  flies  are  sampled.  Second,  a  "units  adjustment" 
might  be  needed  If  the  units  of  observation  in  the  two  files  differ 
(e.g.  persons  and  tax  units).  Third,  "matching  or  common  variables," 
Z,  are  defined  and  it  is  assumed  that  File  1  with  records  carries 

information  on  (X,Z),  whereas  File  2  with  n?  records  consists  of  data 
on  (Y,Z).  The  variables  X  and  Y  are  often  called  non  matching 
variables.  FinaLly,  in  the  "merging”  step,  if  the  records  (X  ,Z  ), 
and  respectively  from  File  1  and  File  2,  are  to  be  matched, 

then  one  completes  the  i^*1  record  In  File  1  by  substituting  Y  for 


„  t*  "iy*^’  cy-. l  , 1  Vfi 


the  missing  value.  Thus,  we  get  the  synthetic  File  1: 

<S1‘Xj*S1),  i  =  1,2 . 

Clearly,  the  same  methodology  can  be  used  to  get  a  synthetic  File  2 
by  finding  substitutes  for  missing  X  values  of  File  2  using  X*s  from 
File  1.  However,  in  order  to  keep  our  discussion  simple,  we  shall 
often  be  concerned  with  completing  only  File  1.  Although,  many 
different  methods  have  been  used  In  this  final  step,  several  basic 
similarities  can  be  identified.  In  most  matches,  certain  Z  variables 
are  treated  as  the  so  called  "cohort"  variables.  Such  variables 
establish  "packets"  of  the  records  in  each  of  the  two  files,  with 
matching  permitted  only  between  pairs  of  oaso3  in  the  same  packet. 

For  example,  sex  is  often  a  cohort  var  able  so  that  a  male  can  be 
matched  with  another  male,  and  a  female  with  another  female.  This 
step  about  the  formation  of  cells  or  packets  Is  aimed  at  diffusing 
the  dissimilarities  between  units  that  are  being  matched.  Further 
more,  depending  on  how  many  of  the  common  variables  are  used  as 
cohort  variables,  there  may  be  very  little  or  no  wlthln-packet 
variation  with  regard  to  Z.  In  such  situations.  File  1  has  data  on 
X  and  File  2  has  data  on  Y  and  we  would  like  to  merge  the  files  to 
get  joint  information  on  X  and  Y.  Note  that,  in  Section  1.3,  such  a 
scenario  was  labeled  Case  I.  The  selection  of  "matching  records" 
within  a  packet  is  typically  based  on  a  "measure  of  dissimilarity"  by 
which  a  "distance"  is  computed  between  a  given  File  1  record  and  each 
potential  match  in  the  supplementary  file.  A  potential  match  with 
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the  smallest  distance  13  chosen  as  the  match  that  will  provide  the 
missing  Y  value  to  a  File  1  record. 


1 . 5  Two  Basic  Types  of  Matching  Strategies 
Suppose  that  the  age  of  an  Individual,  ,  say,  Is  a  matching 
variable.  Then,  one  may  define  a  distance  measure  d,  say,  between 
individuals  1  in  File  1  and  J  in  File  2  by  the  equation 


%  -  lzu  - 


(1.5.1) 


For  fixed  1  •  1,2 . n^ ,  one  will  then  match  one  possible  J“  in 

File  2  with  ith  record  In  File  1  If  j“  minimizes  d  over  J.  That 

is,  J"  depends  possibly  on  1  and  satisfies  the  restriction 


d  =  min  d. , 

U  I<J<n,  13 


( 1 . '  j  .  2  ) 


If  the  choice  of  J*  involves  no  other  restrictions,  then  the  statls 
tlcal  matching  strategy  Is  called  "Unconstrained  Matching".  However, 
there  are  typically  additional  restrictions  subject  to  which  one  must 
choose  the  optimal  match  j“  from  File  2.  Matching  data  files  with 
the  restriction  that  the  variance  covariance  matrix  of  data  items  in 
each  file  be  identical  to  the  variance  covariance  matrix  of  the  same 
data  items  in  the  matched  file  is  an  example  cf  a  "Constrained  Match." 

in  order  to  formulate  this  type  of  merging  mathematically, 
ussume  first  i'or  simplicity,  that  both  files  carry  only  n  records; 
that  is,  the  common  value  of  n^  and  n^  Is  n.  Let 
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=  1  If  1th  record  In  file  1  is  matched  with  the  jth 

record  in  File  2  1  <  l.  j  <  n  (1.5.3) 

0  if  the  ith  record  in  File  1  is  not  matched  with  the 
jth  record  in  File  2 

Then,  the  following  additional  conditions  will  ensure  that  the 
aforementioned  preservation  of  moments  is  achieved  by  not  letting 
more  than  one  record  in  File  1  to  be  matched  with  the  same  record  in 
File  2: 

n 

l  a  ^  1,  for  J  -  1.2 . n  (1.5.4) 

i-i  3 

n 

l  a  =  1,  for  i  •=  1,2 . n  (1.5.5) 

j  =  l  3 

Now  let  d^j  denote,  as  in  the  case  of  a  unconstrained  match,  a 
measure  of  inter-record  dissimilarity  given  by  the  extent  to  which 

the  attributes  in  any  one  record  differ  from  the  same  attributes  in 
another  record.  Then  the  optimal  constrained  match  minimizes  the 
"objective  function" 

n  n 

I  1  d  a  (1.5.6) 

id  j  -1  J  3 

Subject  to  the  restrictions  in  (1.5.3)  to  (1.5.5).  Clearly,  this 
extremal  problem  Is  the  standard  linear  assignment  problem  in 
"Opt  tmi zat Ion . ” 

A  matching  situation  more  typical  of  problems  relating  to  policy 
analyses  is  a  constrained  merge  of  two  files  with  variable  weights 
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in  both  files  and  an  unequal  number  of  records  in  the  files.  Let 
be  the  weight  of  the  Ith  record  in  File  1,  and  let  B^  be  the  weight 
of  the  record  in  File  2.  If  n  ,  n  are  respectively,  the  number 
of  records  In  File  1  and  File  2,  then  we  minimize  the  objective  function 
in  (1.5.6)  subject  to  the  following  constraints. 


I  a  *  a  .  1  =  1.2.  ....  n 
J-l  J 


(1.5.7) 


l  a  =  B  .  j  =  1,2, 
i  =  l  u  J 


,  nr 


(1.5.8) 


n 


1 

l 


1  =  1 


”2 
l  a 
3=1 


J 


(1.5.9) 


and 


a  >  0,  V  i  and  j 


(1.5.10) 


It  is  clear  that  an  optimal  constrained  matching  strategy  when 
the  two  files  have  unequal  number  of  Individuals  is  the  solution  of 
a  standard  transportation  problem  in  which  the  roles  of  the  "ware 
houses"  and  "markets"  are  respectively  played  by  the  records  in  File 
1  and  File  2  and  the  "cost  of  transportation"  is  the  inter-record 
distance  Existing  algorithms  to  solve  a  linear  assignment  or 

transportation  problem  can  be  used  to  complete  the  final  "merge" 
step,  giving  us  the  synthetic  sample 


W" 

~1 


(~i .~1 ’~1 ) * 


1  <  i  <  n. 


(1.5.11) 


I 

i 

i 

1 

S 


S 


I 

I 

I 

8 

I 

S 

S 

s 
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where  Y“  denotes  the  value  of  Y  assigned  to  the  ic  record  of  File  1. 


The  sanple  in  (1.5.11)  may  now  be  used  to  estimate  a  parameter  like 


Y  in  (1.1.1). 


1 . 6  Criticisms  of  Statistical  Watching 


In  Sections  1.4  and  1.5,  we  described  the  general  form  of  most 


matching  techniques  that  have  been  used  by  Federal  Agencies. 


Matching  records  at  the  "packet”  level  means  basically  that  the 


random  vectors  X  and  Y  are  stochastically  independent,  given  the 


value  of  the  common  variables  Z.  In  the  particular  case  of  a  multi 


variate  normal  distribution  for  U  =  (X.Y.Z),  conditional  Independence 


assumption  is  equivalent  to  the  claim  that  the  partial  correlations 


among  X  and  Y  variables,  controlling  on  the  Z  variables,  are  all 


zero.  This  point  was  made  first  by  Sims  (1972)  and  repeatedly  by 


others  since  then.  The  conditional  independence  assumption  is  a 


strong  one  for  which  convincing  Justifications  has  generally  not  been 


offered.  It  implies  that  the  relationships  between  X  and  Y  can  be 


totally  inferred  from  X's  relation  to  Z  and  Y’s  relationship  to  Z. 


Sims  (1978)  stated  that  matching  the  files  under  such  assumptions  is 


unnecessary.  He  also  sketched  an  alternative  statistical  procedure 


that  uses  the  data  in  the  two  files  to  estimate,  under  conditional 


independence,  a  parameter  such  as  y  in  (1.1.1).  Sims’  alternative 


will  be  discussed  further  in  Section  3.2. 


Fellegl  (1978)  and  many  other  investigators  have  expressed  great 


caution  about  the  use  of  statistical  matching  because  not  much  Is 


m 


>i>; 


i 
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known  about  the  accuracy  of  the  estimates  of  the  Joint  distribution 
of  V  produced  by  synthetic  flies. 

Notwithstanding  these  criticisms  of  statistical  matching,  there 
Is  no  viable  alternative  statistical  procedure  that  will.  In  general, 
provide  better  estimates  of  y  than  a  synthetic  file  can  offer. 

Given  this  lack  of  good  alternatives,  especially  when  conditional 
Independence  does  not  hold,  the  area  of  statistical  matching  is  wide 
open  and  both  theoretical  and  empirical  investigations  to  discover 
the  properties  of  synthetic  data-files  are  in  order. 

1 . 7  Reliability  of  Synthetic  Files 
The  precision  of  synthetlc-f lie-based  estimators  of  a  given 
parameter  relevant  to  the  population  of  W  =  (X,Y,Z)  is  affected  by 
various  types  of  errors  that  occur  while  matching  two  files.  To 
discuss  these  matching  errors,  let  us  first  restrict  our  attention 
to  the  cases  where  the  same  individuals  are  in  the  two  files,  namely 
Case  I  and  Case  II. 

In  practice.  It  Is  almost  Inevitable  in  most  matching  projects 
that  some  matching  errors  occur,  even  with  the  most  sophisticated 
procedure  and  the  most  careful  execution  of  matching  of  the  files. 
These  errors  fall  into  two  major  categories: 

(1)  Erroneous  match  (false  match)  or  linking  of  records  that 
correspond  to  different  individuals. 

(il)  Erroneous  non-match  (false  non-match)  or  failure  to  link  the 
records  that  do  correspond  to  the  same  individual. 


I 

1 

1 

1 


a 


i 


i 
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the  reliability  of  the  results  of  a  statistical  matching 
strategy  Is  often  defined  (Radner  et  al. ,  1980,  p.  13)  as  one  of  the 
following  coefficients: 

(a)  the  proportion  of  the  correct  matches,  that  is,  matches  of 
records  n  the  same  individuals. 

(b)  the  proportion  of  erroneous  decisions,  that  Is,  false  matches 
and  erroneous  non- matches. 

These  reliability  coefficients  are  random  variables  because.  In 
view  of  the  terminological  conventions  of  Section  1.2,  a  statistical 
matching  strategy  Is  dependent  on  the  data  in  the  two  files.  The 
sampling  distribution  of  the  reliability  coefficients,  either  exact 
or  asymptotic  (as  the  sizes  of  the  flies  grow) ,  are  very  useful  In 
judging  the  quality  of  a  given  matching  procedure. 

Now,  we  will  discuss  the  reliability  of  a  synthetic  file  In 
Case  III,  where  the  two  files  contain  very  few  or  no  overlapping 
individuals.  First,  note  that  the  definitions  of  error  in  the 
results  of  matching,  which  have  been  proposed  for  Case  I,  are  not 
applicable  to  Case  III  because  the  linkage  of  records  from  the  two 
flies  that  pertain  to  the  same  unit  seldom  occurs  In  Case  HI.  In 
other  words,  almost  all  linkages  in  Case  III  are  false  matches  In  the 
sense  of  the  definitions  given  earlier  in  this  section.  In  Case  III, 
definitions  of  error  and  reliability  which  are  tractable  from  a 
theoretical  perspective  are  unavailable  at  this  time.  In  fact, 
little  theorel leal  work  on  the  errors  present  In  the  synthetic  files 
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of  Case  III  has  been  done.  Until  now,  the  evaluation  of  a  given 
matching  strategy  In  Case  III  has  been  done  from  an  empirical  point 
of  view.  A  case  In  point  Is  the  work  of  Rodgers  (1984). 

1 . 8  Summary 

In  Section  1.3,  three  Important  cases  for  merging  two  files  of 
data  were  distinguished.  Of  these,  Case  I  and  Case  II  are  relevant 
when  the  same  Individuals  are  represented  In  the  two  files.  Case  III 
arises  when  only  similar  individuals  are  present  In  the  files.  This 
research  is  concerned  with  both  theoretical  investigations  and 
empirical  evaluations  of  the  quality  of  synthetic  files  in  Case  I  and 
Case  III.  We  shall  not  discuss  Case  II  in  this  thesis. 

In  Chapter  2,  Case  I  is  discussed  at  some  length.  A  review  of 
known  results  for  this  case  is  given.  New  optimality  properties  of 
a  maximum  likelihood  matching  strategy  are  established.  Some  small 
sample  and  large-sample  properties  of  the  number  of  correct  matches 
with  regard  to  this  strategy  are  derived,  shedalng  some  light  on  the 
reliability  of  the  synthetic  file  arising  from  using  the  maximum 
likelihood  strategy. 

Case  III  is  the  topic  of  interest  in  Section  3.  The  bulk  of  the 
discussion  in  this  Chapter  is  confined  to  matching  two  files  of  data 
that  are  sampled  from  a  trlvarlate  normal  population.  Thus,  If 
(X.Y.Z)  is  a  three  dimensional  normal  random  vector.  File  1  has  data 
on  ( X , Z ) ,  while  File  2  has  data  on  (Y,Z).  Two  strategies  proposed  by 
Kadane  (1978)  and  one  strategy  due  to  Sims  (1978)  are  used  to  create 
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synthetic  flies  out  of  simulated  data  on  (X,Z)  and  (T,Z).  These 
synthetic  files  are  then  evaluated  by  comparing  the  estimates  of  the 
correlation  between  X  and  Y  provided  by  them  with  the  estimates  based 
on  unbroken  data  on  (X,Y,Z). 


2.  MERGING  FILES  OF  DATA  ON  SAME  INDIVIDUALS 


A  useful  classification  of  situations  involving  statistical  mat 
etiing  of  data  flies  was  discussed  in  Section  1.3.  It  may  be  recalled 
that  in  the  context  of  the  two  files  having  the  same  Individuals,  this 
classification  scheme  included  two  cases.  Case  I  Is  the  scenario 
where  no  matching-variables  z  are  present,  while  case  II  is  the 
situation  where  matching-variables  arc  part  of  the  statistical  model. 
In  this  chapter,  we  shall  discuss  results  relevant  to  case  I  only. 


2 . 1  A  General  Model 


T 

Let  t u 3  he  a  multi -dimensional  random  vector  with  C.D.F  H(t,u) 

II 

and  P.D.F  h(t.u).  Let  i  -  1,2,  ...,  n  be  a  random  sample  of 

size  n  from  H.  Ue  shall  assume  that  these  sample  values  got  broken  up 
Into  the  component  vectors  T's  and  U's  before  the  data  could  be 
recorded.  Thus  we  do  not  know  which  T  and  U  values  were  paired  In  the 
original  sample  and  the  two  files  consist  of  the  following  data: 
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File  1  -  x  ,  x„ ,  .  . ,  x  , 

-I  ~d  ~n 

which  is  an  unknown  permutation  of  T, ,  ...,  T  ,  and 

~1  ~n 

File  2  -  Y  ,  Y  ,  .  .  .  ,  Y  . 

“1  ~c  ~n 

which  is  an  unknown  permutation  of  ....  Uri 
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DeGroot,  Feder  and  Goel.  (1971)  call  tills  a  "Broke,*  Random  Sample" 
model  for  two  files. 

Two  types  of  statistical  decision  and  inference  problems  arise 
from  observing  a  broken  random  sample.  The  first  type  of  problem 
involves  trying  to  pair  the  x*s  with  the  j£’s  in  the  broken  data  in 
order  to  reproduce  the  pairs  in  the  original  unbroken  sample.  The 
second  type  of  problem  involves  making  inferences  about  the  values  of 
parameters  in  the  joint  distribution  H(t,u)  of  T  and  U. 

This  chapter  will  be  organized  into  a  review  of  the  literature  on 
matching  problems  in  Sections  2.3  to  2.5.  followed  by  a  discussion  of 
statistical  properties  of  some  matching  strategies  in  Sections  2.6  to 
2.9. 


2 • 2  Notations 

In  this  section,  we  Introduce  most  of  the  notations  that  will  be 
used  in  the  present  chapter. 


T 

(1)  (y)  will  denote  a  multivariate  random  vector.  It  is  assumed  to 


have  an  absolutely  continuous  joint  cumulative  distribution  func 


tion  (CDF)  H(t.u)  and  joint  density  hU^u);  the  context  will  make 


the  dimensions  of  t  and  u  clear.  In  particular,  (y)  will  denote 


a  two  dimensional  random  vector,  with  h(t,u)  and  H(t,u)  respoc 


tivcly  as  the  density  and  CDF  of  (.,).  h  (■>  and  hi*)  will 


yi  .  ..r  .  --  -2, 

respectively  denote  the  marginal  densities  of  T  and  U  and  F(-) 
Gi*)  will  be  the  respective  marginal  distribution  functions. 


The  symbol  g^( * )  will  be  the  generic  notation  for  the  density 


V' 


i 


as  MSm ;  as  at,  as  aa  a  a* 


sjams,. k : i£g ^  ^ ii m  ssasgs ,r~  aasasas 
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function  of  the  random  vector  £.  Without  the  suffix,  g(»)  will 

denote  a  real  valued  function. 

Ti 

(2)  Let  (u^),  1  =  1,2,  ....  n  be  a  random  sample  from  the  population 

T  1  n 

of  (y).  Let  Fn { x )  *  —  £  *(Ti<x)  be  the  emP^rtcal  C.D.F 

based  on  the  variables  T  .  T  .  Similarly,  G  (x)  will  be 

in  n 

the  empirical  C.D.F  based  on  U^t  . . .  ,  . 

n 

Let  R^  ^  £  X(r  >T  )  be  tbe  rank  of  among  the  variables 

a=l  a 

T,  ,  ....  T  ,  where  i  -  1,2 . n.  Similarly,  R„..  ,  ....  R„ 

In  2i  ?n 

will  denote  the  rank  order  of  the  variables  U. ,  ....  u  . 

i  n 

(3)  Let  -  (<p(l),  .  ..,  v>(n ))  be  a  permutation  of  the  integers 

1,2 . n.  4*  will  stand  for  the  set  all  such  permutations. 

Also,  let  =  (1,2,  ....  n). 

(4)  Let  c>0.  V  l  =  1,2,  ....  n,  define  events  A  ,  (<p,c)  as  follows: 

~  nl 


Ani(,p*e)  *  [|U(>P(R11))  ‘  Ui 1  1  C] 


Let  Ani(c)  -  Anl  (ij.*' ,  c  ) ,  i  -  1,2 . n, 


Ani  =  Anl(^'0)  5  <Rli  -  *21  >’  1  '  l*2 . n- 


,et  V  (<p,c)  I  .  ,  l  1,2  ....  n . 

ni  A  .  ( <p  ,  c  ) 

nl 


V  (c)  -  I  ,  „  1  —  1 ,  n 

nl  Ani(f  .c) 


Vnl  =  rA  •  1  -  2-  •  •  ■  •  n 

ni 


(2.21) 

(2.2.2) 

(2.2.3) 

(  ?  .  2  .  H  ) 

(2.2.5) 

(2.2.6) 
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b)  Let  c(x,y)  be  the  generic  notation  for  a  joint  density  of  two 
random  variables  T  and  U  which  are  marginally  uniform.  Then, 
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define  the  constant  X  as  J  c(x,x)dx,  which  Is  the  density  of  the 

0 


random  variable  T-U  evaluated  at  zero.  For  any  fixed  Integer  d, 
define 


S  =  (S  ,,  ....  S  .),  where 
~n  nl  nd 


(2.2.7) 


SnJ  =  R1J  R2j’  J  =  1,2 . n' 


Note  that  if 


*Jk  3  1  (Tj-T^>0)  ^Uj-U^O)  *  vl<J<dandl<k<n 


(2.2.8) 


then  we  get  the  representation 


SnJ  '  ^  V  J  '  l-’ . 


(2.2.9) 


Let  Sr  3  ' 

Then, 


•••  V 


(2.2.10) 


§n  =  l  £k 

k-1 


(2.2.11) 


Let  ^ljk  1  ( Tj-T^>e  )  I(UJ-Uk>0)  *  1  5  k  1  " 


*2Jk  >0)  “  1  (T  T  >-c  J  1  -  *  -  n 

J  K  J  " 


(2.2.12) 


Let  L  =  T-U  and  L 


J 


T,  -  U.,  where  1  =  1,2,  ...  .  Let  A.  be 
J  J  d 


the  sigma-field  o(W  ,  ...,  W)  generated  by  the  vectors 


~1  =  (Ui).  ....  d.  Let  ¥n(6)  be  the  generic  notation  for 

the  characteristic  function  of  a  random  vector  n,  6  being  e  vec 
tor  of  dummy  variables  whose  dimension  is  tne  same  as  that  of  n 
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Let  L  (w  ,  ....  w  )  be  the  variable  £  ,  when  W  takes  the 

,  1 K  ~  1  "''u  j  K 


value  w^ ,  1  =  1,2,  ....  d. 

Let  F  (w  •  -  -  ,  w.)  and  S 
~a  ~i 


n 

i 

k-1 


*  V*1 . V  be 


respectively  and  when  =  w^ ,  1  =  1,2 . d. 

Let  4*.  =  'f',(w,  ,  ....  w.)  be  t.he  negative  logarithm  of  the 

Q  Q  ^1  — Q 

imx'ulus  of  the  characteristic  function  of  ^d+i-  (W^,  V^5 


2 . 3  Data-based  Matching  Strategies 
Pairing  the  observations  in  the  two  data-files  that  were  des¬ 
cribed  in  Section  2.1  should  be  distinguished  from  the  problem  of 
matching  two  equivalent  decks  of  n  distinct  cards,  which  is  discussed 
in  elementary  textbooks  such  as  Feller  (1968).  One  version  of  card 
matching  is  as  follows.  Consider  a  "target  pack"  of  n  cards  laid  out 
in  a  row  and  a  "matching  pack”  of  the  same  number  of  cards  laid  out 
randomly  one  by  one  beside  the  target  pack.  In  this  random  arrange 
ment  of  cards,  n  pairs  of  cards  are  formed.  A  match  or  coincidence 
is  said  to  nave  occurred  in  a  pair  if  the  two  cards  in  the  pair  are 
dentlcal.  Because  the  two  decks  are  merged  purely  by  chance  and 
without  using  any  type  of  observations  or  other  information  about  the 
cards,  one  may  describe  such  problems  as  no-data  matching  problems . 

An  excellent  survey  of  various  versions  of  card-matching  schemes  is 
found  in  Barton  (L9*i>8). 

Suppose  that  N  denotes  the  number  of  pairs  in  the  aforementioned 
matching  problem  which  have  like  cards  or  matches.  The  derivation  of 
the  probability  distribution  of  N  dates  back  to  Montmort  (1708).  The 


following  is  a  summary  of  some  of  the  well  known  properties  of  N 
(Feller  1968): 

Proposition  2.3.1:  If  P.  ,  is  the  probability  of  having  exactlv  m 

[m] 

matches,  then 


n  1_  ,  1_  1_ 

P[m]  ml  11  1  21  ~  3!  + 


[n]  n! 


(n-m) ! 


]  ,  m  =  0,2 . n-1 


e  l 

(11)  Noting  that  y  is  the  probability  that  a  Poisson  random 

variable  with  mean  1  takes  the  value  m.  we  have  the  following 
approximation  for  large  n: 


(ill)  For  d  =  1,2 . n,  the  dth  factorial  moment  of  N,  namely 

E(N(d)),  is  1. 

As  one  might  expect,  for  certain  broken  random  sample  models,  it 
pays  to  match  two  files  of  data  using  optimal  strategies  based  on 
such  data.  Several  authors  starting  with  DeGroot,  Feder  and  Goel 
(1971)  have  proposed  and  studied  matching  strategies  based  on  broken 
data.  In  Section  2.9,  it  will  be  shown  that,  for  certain  matching 
strategies  based  on  independent  variables  T  and  U  the  distributional 
properties  of  the  number  of  correct  matches  are  the  same  as  those 
mentioned  In  Proposition  2.3.1.  In  other  words,  as  far  as  statls 
tical  properties  of  N  are  concerned,  matching  files  of  data  on  inde 
pendent  random  variables  is  only  as  good  as  no  data  matching  in  which 


we  randomly  assign  unLts  In  one  file  to  the  units  in  the  other  file. 


2 . 4  Repairing  a  Broken  Random  Sample 


2.4.1  The  Basic  Matching  Problems 

Let  us  consider  matching  the  broken  random  sample  x  ,  x2>  ..., 

x  .  y, . y„  by  pairing  x.  with  y  ....  for  1  *  1,2 . n  where 

n  l  n  l  <p(  1 5 

<f  -  (if(l),  ....  <p(n))  Is  a  permutation  of  1,2,  ....  nv  As  we  seek  a 

<p  from  ♦  that  will  provide  reasonably  good  pairings  of  the  x’s  with 

the  y's,  we  need  to  clarify  the  fundamental  role  of  cp  In  the  statis 

tical  model  described  In  Section  2.1.  If  we  treat  <p  as  an  unknown 

parameter  of  the  model,  then  the  likelihood  of  the  data  will  Include 

s>.  For  Instance,  if  T  and  U  are  Jointly  bivariate  normal  with  means 

2  2 

u  ,  y  variances  a ^  and  correlation  coefficient  p,  then  the 

2  2 

log  likelihood  function  of  <p,  p  y. ,  yot  a  ,  o  ,  given  the  broken 
random  sample.  Is 


2  2 

U<p  ,p.u1,y2,a1.o2!xl. 


y 


1’ 


'  V 


n,,,  2  n,  2  n  2 

2  logd  -  P  >  2  log  °1  "  2  log  °2 


_ 1  _ 

2(1  p?) 


n 


[  l  U 
1 ,1 


♦ 


n 
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i-i 


<y< 
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/  o 


2 

2 


n 

l 

\  a 


(X. 


V(y<o(l) 


VXV2] 


(2.4.1) 


A  constant  term  not  Involving  the  parameters  has  been  omitted  In 
(2.4.1).  In  subsection  2.4.2,  we  shall  seek  ip’s  that  maximize  the 
likelihood  such  as  this.  On  the  other  hand,  some  statisticians 
would  regard  <*>  as  some  sort  of  missing  data  and  not  as  a  parameter 
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of  the  underlying  model.  The  problem  of  pairing  the  two  files  will 
not  arise  in  such  situations.  However,  one  may  still  want  to  do 
statistical  inference  for  other  parameters  of  the  model  based  on  the 
broken  random  sample.  Such  Issues  sre  not  pursued  in  this  thesis 
and  one  may  refer  to  DeGroot  and  Goel  [I960)  for  an  approach  to 
estimating  the  correlation  coefficient  p  while  treating  as 
missing  data  Ln  the  bivariate  normal  model. 

2.4.2  The  Maximum  Likelihood  Solution  to  the  Matching  Problem 

Ue  start  with  a  bivariate  model  used  in  DeGroot  et  al .  (1971) 
which  assumes  that  the  parent  probability  density  function  of  (y)  is 
h(t,u)  =  a(t)  B(u)  exp[y(t)  Mu)]  (2.4.2) 

where  a.  B,  y.  A  are  known  but  otherwise  arbitrary  real  valued 

functions  of  the  Indicated  variables.  Suppose  now  that  x,  ,  ....  x 

1  n 

and  y  ,  . . . ,  yn  are  the  observations  ln  a  broken  random  sample  from 

a  completely  specified  density  of  the  form  (2.4.2).  If  was  paired 

with  y  ...  for  i  =  1,2,  ....  n,  in  the  original  unbroken  sample,  then 
<p(  l ) 

the  joint  density  of  the  broken  sample  would  be 

n  n  n  n 

n  h[x,,y  ]  -  [  n  a(x  )][  n  B(y,)]exp[  £  y(x  )  4(y  .,.)] 

•  i  1  4>(  i)  i  ,  i  .  i  *  ,  i  v>(  i) 


(2.4.  i) 


Thus  the  maximum  likelihood  estimate  of  the  unknovm  permutation  ip  Is 

n 

the  permutation  for  which  £  Y(Xi>  ^Yipf!))  maximum.  Without 

11 

loss  or  generality,  we  shall  assume  that  the  Xj/s  and  yj's  have  teen 

reindexed  so  that  y(x,)  <  ...  <  y(x  )  and  <My,)  <  ...  <  6(v  ). 

1  ~  n  1  ■■  r. 
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T 

Since  (jj)  is  assumed  Lo  have  an  absolutely  continuous  distribution, 

with  probability  one,  there  are  no  ties  among  y(x^)'s  or  a(y^)’s. 

DeGroot  et  al .  (1071)  shows  that  the  maximum  likelihood  solution  Is 

to  pair  x^  with  v^,  for  1  -  1,  ....  n.  In  other  words,  the  maximum 

likelihood  pairing  (M.L.P)  is  <*"  =  (l,  n)  . 

In  particular,  If  the  density  In  2.4.2  Is  that  of  a  bivariate 

normal  random  vector  with  correlation  p,  then  M.L.P, can  be  described 

knowing  only  the  sign  of  p.  If  p  >  0,  the  M.L.P.  Is  to  order  the 

observed  values  so  that  x,  <  .  .  <  x  ar.d  y,  <  ...  <  y  and  then  to 

1  n  1  n 

pair  with  y^,  for  1-1,2,  . ...  n-  IfpcO,  the  solution 
Is  to  pair  and  y(n+1  ,  f°r  1  1,2,  ....  n.  If  p  -  0,  all 

pairings,  or  permutations,  are  equally  likely. 

Chew  (1973)  derived  the  maximum  likelihood  solution  to  the 
(bivariate)  matching  problem  for  a  larger  class  of  densities  h(t,u) 
with  a  monotone  Likelihood  ratio.  That  is,  for  any  values  t^,  t2, 
u  ^  and  u?  such  that  t^  <  and  <  u^, 

h(t,u)h(t,u)>h(t,u)h(t,u)  (2.4.4) 

11  ?  2  12  2  1 

As  before,  we  shall  assume  that,  the  values  x  ,  ...,  x  and 

1  n 

y^,  ...  y ^  In  e  broken  random  sample  are  from  a  density  h(t,u) 

satisfying  (2.4.41.  Without  loss,  relabel  the  x's  and  y’s  so  that 

x,  *  .  .  <•  x  and  y.  -  .  .  .  <  y  Then  permit  tat  ton  u>"  (l,  ...  n) 

1  n  1  n 


is  again  the  M.L.P. 


DeGroot  et  al .  (1971)  studied  the  matching  problem  from  a 
Bayesian  point  of  view  as  well.  They  proposed  three  optimality 


criteria,  subject  to  which  one  may  choose  the  matching  strategy  tp. 


Before  we  state  these  criteria,  we  need  some  notation  and  definitions. 


Let  x^ ,  ....  and  y^,  .  .  .  ,  y  be  the  values  of  a  broken 


random  sample  from  a  given  parent  distribution  with  density  h(t,u). 


If  x4  is  paired  with  y  1  =  1,2  ....  n,  then  the  likelihood 


function  of  the  unknown  permutation  cp  Is  given  by  the  equation 


L(<p)  *  n  h(t  ,u  >, 
1=1  1  v<1) 


(2.4.5) 


Assume  that  the  prior  probability  of  each  permutation  is  .  Then 

n  1 


the  posterior  probability  that  v  provides  a  completely  correct  set 


of  n  matches  is 


P ( 4> )  -  L v «p  1  /  X  L(»|i) 
«!*<?♦ 


(2.4.6) 


For  j  =  1,2,  ...  n,  let 


*(.))  -  {><>0+:  vp(l)  =  J} 


(2.4.7) 


be  the  set  of  (n  -  1)1  permutations  which  specify  that  xj  is  to  be 
paired  with  y^.  Using  the  definitions  m  (2.4.6)  and  (2.4.7),  we  get 


the  posterior  probability  that  the  pairing  of  and  y  yields  a 


correct  match  to  be 


Pj  =  l  P(4>) .  1  <  J  <  n 
( j  ) 


(  2 . 4  .  a  ) 


For  any  two  permutations  <p  and  ip  in  let 
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K(<p.<|,)  It  {i:  <p(i)  =  -4*<  i ) } 


That  Is,  K(<4>,4»)  Is  the  number  of  correct  matches  when  the  observe 
lions  in  the  broken  random  sample  are  paired  according  to  <p  and  the 
vectors  in  the  original  sample  were  actually  paired  according  to  4;. 


1 


It  then  follows  that  for  any  permutation  ip€$,  the  quantity 

M( «p)  =  l  K(<p.4»)  P(«l>) 

4>G$ 


(2.4.9) 


:'4 


4 

i 

''.*1 


1 

4 


is  the  posterior  expected  number  of  correct  matches  when  is  used 
to  repair  the  data  in  the  broken  random  sample. 

Finally,  let  ♦  be  the  set  of  all  permutations  such  that 
1 1  n 

”„<n  ■  h  lna  »,,<„>  *  V 

DeGroot,  Feder  and  Goel  (1971)  have  proposed  three  optimality 
criteria,  subject  to  which  one  may  choose  the  matching  strategy  : 
(1)  maximize  the  probability,  p( q>) ,  of  a  completely  correct  set  of 
n  matches, 

(ii)  maximize  the  probability,  p^ ,  of  correctly  matching  xL  by 
choosing  an  optimal  j  from  {1,2,  ....  n)  and 

(iii)  maximize  the  expected  number,  M(<<>),  of  correct  matches  in  the 
repaired  sample. 


fti 


4 


8* 


1 

•SI 


Assuming  that  the  bivariate  density  of  T  and  U  was  given  by 
h(t.u)  -  a(t)b(u)  etU,  (t,u)  GR2  ,  the  Following  results,  among 
others ,  were  established  by  DeGroot  et  al .  (1971): 

(a)  The  M.L.P  <f“  maximizes  the  probability  of  correct  pairing  of  all 
n  observations . 


1 

:$ 
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(b)  The  probability  of  pairing  x^x^)  correctly  is  maximized  by 

pairing  with  y^ty^. 

(c)  The  class  of  permutations  #,  is  complete;  that  is,  given  any 

l  ,n 

permutation  ,  there  exists  a  which  is  as  good  as 

1  ,  n  i ,  n 

<4>  in  the  sense  that  M(iji)  >  M(<p). 

(d)  Sufficient  conditions  in  terms  of  the  data  x, ,  ....  x  and  y,  , 

1  n  1 

....  y^  for  the  M.L.P  <p“  to  maximize  M(ip)  were  also  given. 

The  results  in  Chew  (1973)  and  Goel  (1975)  are  extensions  of  (a) 

through  to  (d)  to  an  arbitrary  bivariate  density  h(t,u)  possessing  the 

monotone  likelihood  ratio.  The  "completeness"  property  in  (c)  implies 

E  E 

that  the  permutation  maximizing  M(q>)  satisfies  <p'(l)  =  1  and 
E  E 

»P  (n)  =  n,  for  n  =  2 .  3 ,  <4>“  E  <p  .  DeGroot  et  al .  (1971)  show  that  for 

g 

n  >  3 ,  <p  Is  not  necessarily  equal  to  the  M.L.P  by  means  of  a 
counter  example. 

2.4.4  Matching  Problems  for  Multivariate  Normal  Distributions 

In  our  review  so  far,  we  have  discussed  optimal  matching 

strategies  only  in  the  case  of  bivariate  data,  one  variable  for  each 

of  the  two  files.  However,  multivariate  data  are  often  available  in 

T 

both  files.  Suppose  then  that  we  have  a  model  where  (y)  has  a  (p*q) 
dimensional  normal  distribution  with  known  variance  covariance  matrix 
Let  us  write  T  and  its  Inverse  in  the  following  partitioned  Form: 
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where  Doth  and  have  dimension  p  x  q. 

As  before,  we  shall  let  x . x  and  y,  ,  ....  y  denote  the 

~1  ~n  **1  *-n 

values  In  a  broken  random  sample  from  this  distribution,  where  each 

x  is  a  vector  of  dimension  p  x  1  and  each  y  vector  has  the  dimen- 

~i  J 

sion  q  x  1.  The  results  to  be  presented  here  were  originally  ties 
eribed  by  DeGroot  and  Goel  (1976). 

The  likelihood  function  L,  as  a  function  of  the  unknown  permu 
tation  <p »  can  be  written  in  the  form 

L(q»)  =  exp[-X  ft12  (7.^4  10) 

since  the  other  factors  in  the  joint  density  of  the  sample  do  not 
depend  on  ip.  If  we  again  assume  that  the  prior  probability  of  each 
permutation  tp  is  then  tne  posterior  probability  that  if  provides 
-  completely  correct  set  of  n  matches  is  given  by  (2.4.6).  Thus, 
maximizing  p(<p)  is  equivalent  to  Maximizing  L(cp),  or  equivalent  ly 
minimizing 


Q(<p) 


n 


l 

\  -1 


ft 


12  *-«p(  i ) 


(2.4.11) 


There  is  no  simple  way,  in 
hood  solution. 

However,  if  rank  (X|2) 
represented  in  the  form 
dimensions  p  x  1  and  q  x  1. 
for  i  =  1,2.  ....  n ,  the  <p“ 


general,  to  describe  the  maximum  likeli 

-  1,  then  rank  (Q  )  “  1  and  ^12  ean  be 
=  a'b,  where  a  and  b  are  vectors  of 
If  we  let  y(x.)  =  a*x,  and  A ( y , >  =  b'y 
will  be  the  permutation  that  minimizes 


i 
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«„>  ‘  j,  ’''-S'  “Vt)1 


(2.4.12) 


Now,  minimizing  (2.4.12)  It;  achieved  by  arranging  y  <  x  i )  *  s  Trow 
smallest  to  largest,  arranging  4(y^)*s  In  the  reverse  order  from 

the  largest  to  smallest  and  thet  pairing  the  corresponding  elements 
In  the  two  sequences. 

Suppose  next  that  rank  (S2  )  -  1,lthout  loss  of  generality, 

we  shall  assume  that  p  <  q  and  let  v1  =  **or  ^  =  ••••  n- 

Then,  both  x^  and  v^  are  p-dlmer.slonal  vectors,  and  the  maximum  llkeli 
hood  solution  (p"  will  be  the  permutation  that  minimizes 


a«-»  ■  l=i  Si  •,,!) 


Let  D  denote  the  n  x  n  matrix  ((d))  whose  elements  are  d  xjv 
Then  minimizing  (2.4.14)  Is  equivalent  to  minimizing 


n  n 

Q(>p)  =  l  Id  a 

1=1  j  =  l  J  J 


subject  to  the  constraints 


I  a  =  l,  for  J  -  1,2 . n, 

l  =\  J 


I  a  -  1,  for  1  =  i,2 . n , 

J  1  J 


a 1 j  =  0  or  1, 


3? 


which  Is  a  standard  assignment  problem  with  cost  matrix  D.  Although, 
there  Is  no  simple  form  for  the  solution  of  an  arbitrary  assignment 
problem  of  this  type,  efficient  algorithms  ate  available  for  finding 
numerical  solutions. 

g 

The  permutation  that  maximizes  the  expected  number  of 
correct  matches  is  very  difficult  to  calculate  when  p  and  n  are 
moderately  large.  No  efficient  algorithms  are  known.  A  Monte  Carlo 
study  was  reported  by  DeGroot  and  Goel  (1976)  in  which  they  compare 

g 

tp  and  <p"  for  p  =  2  and  50  different  covariance  matrices  £  with  the 

sample  size  n  =  3,  4  and  5.  In  all  cases,  the  proportion  of  samples 
£ 

for  which  and  <p"  were  Identical  was  between  0.92b  and  0.995. 

Thus,  it  Is  not  unreasonable  to  use  <p"  even  when  the  goal  Is  to  maxi 
mize  the  expected  number  of  correct  matches. 

DeGroot  and  Goel  (1976)  studied  two  other  simple  matching 
strategies  which  provide  good  approximations  to  the  M.L.P  <p"  or  to 

g 

the  rule  ip  .  We  shall  not  discuss  them  here.  In  the  rest  of  this 
chapter,  we  shall  dlccuss  matching  problems  only  in  the  bivariate  case. 

2.6  R eliabl 1 lty  of  Matching  Strategies  lor  Rlv arlate  Data 

Consider  a  random  sample  of  size  n,  (,,1/,  ...,  t  ,'*in )  ,  from  a 

Ul  Un 

bivariate  population  with  density  h(t,u).  If  the  pairings  in  this 

sample  are  lost  before  the  entire  data  was  recorded,  we  still  can 

observe  the  marginal  order  statistics.  In  fact,  if  x^,  and 

y  ,  . ...  y  is  the  broken  random  sample  corresponding  to  the 

T 

unobserved  sample  on  (^),  then  clearly  the  order- stat 1st les 


X, , „  <  ...  <  x,  .  of  the  x's  are  exactly  the  same  as  the  order  stat- 

(1)  (n) 

lstlcs  T,  <  ...  <  T.  ,  of  the  T's.  Similarly,  the  order ^statistics 
(1)  (n) 

Y, ,,  <  Y,_.  <  ...  <  Y .  .  are  the  same  as  U,,.  <  ...  <  u.  . .  The 

(1)  (2)  (n)  (1)  (n> 

repairing  of  the  x's  and  y's  was  introduced  in  Section  2.4.  Thus 
for  each  permutation  <<>  in  *,  there  is  a  matching  strategy  and  the 
typical  merged  file  consists  of  the  pairs 

x(  i) 

(y ,  )  .  1  *  1.2 . n.  (2.5.1) 

i)) 

Some  optimal  matching  strategies  were  discussed  in  Section  2.4. 

Here,  we  are  concerned  with  the  quality  of  the  file  in  (2.5.1). 

Ideally,  we  would  like  to  choose  a  «<>  For  which  the  file  in 

T 

(2.5.1)  recovers  all  the  (y)  pairs  that  we  did  not  observe.  It  is 
therefore  natural  to  look  at  the  random  variable  N(«p),  the  number 
of  correct  matches  due  to  or,  equivalently,  the  number  of 
unobserved  sample  points  which  have  been  recovered  in  (2.5.1).  It 
should  be  pointed  out  that  ,  which  was  defined  In  Section  2.4,3, 

is  different  from  E[N(ip)]  because  the  former  quantity  is  a  posterior 
expected  value  given  a  particular  broken  random  sample  and, 
in  the  latter,  the  expectation  is  taken  over  all  possible  samples. 

Situations  often  arise  where  it  is  not  crucial  that.,  after  the 
two  files  are  matched,  the  matched  pairs  are  exactly  the  same  as  the 
pairs  of  the  original  data.  For  example,  when  contingency  tables  are 
contemplated  for  grouped  data  on  continuous  variables  T  and  U,  we 
may.  In  the  absence  of  the  knowledge  of  the  pairings,  would  like  to 
reconstruct  the  pairs  but  would  not  worry  too  much  as  long  as  the 
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U-value  Ln  any  matched  pair  came  within  a  pre- fixed  tolerance  t  (a 
non  negative  number)  of  the  true  U  value  that  we  would  get  in  the 
Ideal  match  of  recovering  all  the  original  pairs.  This  type  of 
"approximate  matching"  was  first  Introduced  by  Yahav  (118?)  who 
defined  c  correct  matching  as  follows: 

Definition  2.5.1  (Yahav):  A  pair  in  the  merged  file  (2.5.1), 

/  X  (  ^ )  \ 

1  v  )  .  say,  Is  c- correct  if  jU.  .....  -  U r . , I  <  c,  where  c  >  0 
\*<p(l)/  1<p(1>)  [i]  ~ 

and  U  the  concomitant  of  X^;  that  Is,  the  true  U-value  that 

was  paired  with  in  the  original  sample. 

The  number  of  c- correct  matches,  N(<p,c),  ln  the  merged  file 
(2.5.1)  is  given  by 


i 


N  ^  ^  •  c  *  ^  *[|U  -U  l<cl 

1=1  LluU(i))  u  1 1  j 1  1  CJ 


(2.5.2) 


Note  that  as  c  10,  N(n>;c)  converges  (almost  surely)  to  N(tp;0),  which 
is  a  count  of  the  exact  (0  correct)  matches.  Hence  N(<p),  the  number 
of  correct  matches  due  to  v>  can  be  obtained  from  N ( tp ;  c )  by  formally 
letting  c  =  0. 

In  the  light  of  the  definition  of  reliability  of  a  merged  file, 
given  In  Section  1.7,  the  counts  N(q>)  and  N(tp,c)  are  useful  indices 
whose  statistical  properties  reflect  the  reliability  of  the  merged 
file  resulting  from  <*>.  We  shall  study  these  performance  character 
i sties  In  the  following  sections. 


a 


a 


£ 
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2.6  An  Optimality  Property  of  the  Maximum 
Likelihood  Pairing  <p" 

The  known  results  about  the  optimality  of  the  maximum  likelihood 
pairing  ip"  =  ( 1 .  .  .  ,  n)  with  respect  to  some  Bayesian  criteria 

were  reviewed  In  Section  2. A.  Here,  we  shall  propose  a  new  criterion 
and  establish  that  ip"  Is  optimal  with  respect  to  that  criterion. 

Consider  the  random  variable  N(ip),  the  number  of  correct 
matches  which  result  when  a  permutation  ip  in  *  is  used  to  merge 
the  broken  random  sample  from  a  bivariate  population.  In  this 
section,  we  shall  show  that  (p"  maximizes  E(N(<p>),  the  expected 
number  of  correct  matches,  provided  that  the  parent  density  h(t,u) 
exhibits  certain  dependence  structures. 

We  begin  with  quoting  a  very  useful  result  on  the  exchange 
ability  of  random  variables  from  Randles  and  Wolfe  (1079). 

Lemma  2.6.1:  If  £  4  5  and  K { • )  is  a  measurable  function  (possibly 
vector  valued)  defined  on  the  common  support  of  these  random  vectors, 
then 

K(£)  i  K ( n ) 

We  now  establish  a  representation  for  N(»p,t)  as  a  sum  of 
exchangeable  Bernoulli  random  variables,  which  will  he  useful  for 
ext  ending  results  of  Yahav  (IUX2). 

Theorem  2.6.1:  Let  N(<p,t)  and  V  (+,1)  u‘*  a ::  defined  by  1  ;.'<  <•)  and 
(,'.2. 4)  respectively.  Then 
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V  if  111  ♦,  K(v,i)  =  £  V  ( y> ,  c  1  . 

\  =  l  nl 


where  the  summands  are  exchangeable  random  variables. 


(2.6.1) 


Proof:  The  order  statistic  U.  , and  »he  concomitant  Ur.,  of  T,,,, 

-  (<f  ( 1 ) )  [11  ( 1 ) 

used  In  (2. ‘>.2)  can  be  written  In  terms  of  ranks  of  T's  and  U's  as 


fol lows . 


'(>0(1))  "  *  U<»  l(R,  -  <P(1>) 

Ci  -  1  r.<x 


(2.6.2) 


1  Ua  T(R  ^1) 
a  =  l  la 


(2.6.3) 


Note  that  N(ip.r)  Is  simply  a  count  of  how  many  pairs  In  the  merged 


file  due  to  namely. 


)  .  i  1,2,  .  .  .  n 

u(<o(U) 


(2.6.6) 


sat  1 sfy 


|U(*<i»  ,J[il!  <  C 


(2.6.6) 


If  (2.6.5)  holds  for  some  1,  then  3a  j  such  that 


|U,  ,  .  ,  ,  U  .  |  < 
(y>(  i ) )  J  ' 


In  view  of  the  continuity  of  (T^.U^),  this  correspondence  is  one  to 
one.  Therefore,  the  count  N(y>,c)  must  be  the  same  as  the  count  given 
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N(<p,c)  =  l 
1  =  1 


(|U 


(v(Rn)) 


-  Us I  <  c) 


(2.6.6) 


Hence,  (2.6.1)  holds  by  virtue  of  the  definition  (2.2  4)  of  Vnl. 

Towards  showing  the  exchangeability  of  the  V^’s,  note  that  the 
original  sample  in  (2.6.6)  are  independent  and  identically 
distributed  vectors.  Hence,  using  the  equal  In  distribution 
notation,  we  get 


(U 


al’ 


W  ) 
-an 


d 


(W, 


V 


(2.6,7) 


where  (a,,  ....  a  )  is  an  arbitrary  permutation  of  (1,2 . n) . 

1  n 

Define  a  function  f  -  (f^ . fn>  from  R2n  to  Rn  by  the  equations 


\  1  lf  1  =  1  I(Vbl-(1  ~  tp(*^  1<a.-a.>0))  1  A  1  (b,  b,  >  c) 


1-1  J  1- 


ii  j  l 


L 


If  otherwise 


j  --  1,2 . n , 


(2.6.8) 


where  <p  is  the  matching  strategy  we  started  with  and  (a  pbj,  .... 

a  ,b  )  is  an  arbitrary  point  in  R2°. 
n  n 

It  follows  from  (2.6,7)  and  Lemma  2.6.1  that 


£««.i . s,n>  =  £«»i . 


(2.6.9) 


Fix  j  as  an  Integer  in  [1,2,  ....  n)  .  Then,  using  (2.6  3)  we  see 

t  hat  f  (V . Y  )  Is  the  Indicator  function  of  t  tie  event 

j  -a 1  -an 

n  n  n 


I  1 


U  >f  ) 


<  ). 

1  -1 


[(T  ,  T, >0)  } 
°d  1 


>; 

1  d 


(II 


.1  1 


« )  . 


I 
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or,  equivalently,  In  terms  of  the  rank3  R,,,  . . . ,  R,  of  the  T’s  and 

11  in 

the  empirical  C.D.F  of  the  U*s, 


G  (U  ..)  <  <p  ( R  )  /n  <  G  (U  ♦  c) 

"  ”j  "  ~  "  “j 


Observing  that  Gn  (k/n)  =  U^,  k  -  1,2,  ...,  n,  we  find 

f  (U  ....  W  )  Is  1  Iff  |U  U  |  <  c.  By  tne  same  token, 

j  ~al  ~an  " 

f,(W. . W  )  is  the  indicator  of  the  event  ju,  .  _  . .  -  U.l  <  c. 

J  ~1  ~n  J  - 

So  that  f^(Wlt  ....  W^)  =  Vnj(^>,c).  From  these  facts  and  (2.6.9)  it 
follows  that 


(V„  (<p.e) . {<p,c)) 

no,  no 

1  n 


=  . “nn'*’'1) 


(2.6.10) 


Because  (alt  ...,  an)  is  an  arbitrary  permutation  of  1,2,  ....  n, 
we  conclude  from  (2.6.10)  that  the  summands  in  (2.6.6)  are  exchange 
able  random  variables. 

Corol lary  2.6,1:  The  number  of  correct  matches  resulting  from  the 
matching  strategy  vp  has  the  representation 


i 


1 


H 


n 

N(<P)  =  1  I(R  ,R  n 

1  =  1  (R2i'p(Rli>) 


(2.6.11) 


Proof :  Set  t  =  0  In  Theorem  2.6.1.  □ 

We  will  need  the  following  special  dependence  structures  for 
the  population  density  h(t,u).  (see  Shaked  1979). 

Definition  (2.6.1):  Exchangeable  random  variables  T,U  arc  said  to 


1 


1 


be  positive  dependent  by  mixture  (PDM)  Iff  the  joint  distribution  of 
T,U  Is  that  of  and  g<I0.52>.  where  5  and  are  1.1. d 

random  variables,  Is  a  random  vector  which  is  independent  of  5. 

O  I 

and  ^2  an<^  g  is  a  Borel  measurable  function. 

Definition  (2.6.2):  Exchangeable  random  variables  T.U  are  said  to 
be  positive  dependent  by  expansion  (PDE)  iff  the  joint  distribution 
of  T  and  U  admits  tne  following  series  expansion: 

dH ( t , u )  -  [1  +  l  a1n1(t)n1(u)]  dF(t)dF(u)  (2.6.12) 

where  F(*)  is  the  marginal  CDF  of  T  or  U,  a^'s  are  nonnegative  real 
numbers,  and  is  a  set  of  functions  satisfying 

CO 

l  n  (x)  dF( x )  =  0,  1  =  1,2 .  (2.6.13) 

-QD 

According  to  the  Definitions  2.6.1  and  2.6.2,  the  dependence 
concepts  will  apply  only  to  pairs  of  exchangeable  random  variables. 

It  may  also  be  noted  that  for  most  of  the  known  expansions  of  PDE 
distributions,  the  set  of  functions  (n^(*)}  satisfies,  in  addition  to 
(2.6.13),  the  orthogonality  conditions 

CD 

J  n  (x)nt(x)  dF  ( x )  -=  6  ,  (2.6.14) 

CD 

where  k,  l  -  1,2,  ...  and  Is  the  kronecker  delta. 

Ue  now  give  two  examples  to  illustrate  these  concepts  of 


dependence . 


i 


4  0 

Example  2 .  6 . 1 :  Let  Si»  ^2  be  11  <3  standard  normal  random 

variables.  Let  p  be  any  constant  in  the  interval  [0,1].  Define  new 
random  variables 

T  /lp  •  *  /p 

U  =  /T^p  •  \2  +  /p  lQ 

Then,  it  is  easy  to  verify  that  T.U  are  jointly  normal  and  that  the 
definition  (2.6.1)  can  be  applied  to  T  and  U  with  the  above  choice 
of  and  Hence,  the  standard  bivariate  normal  distribution 

with  nonnegative  correlation  has  the  PDM  property. 

Also,  Mardla  (1970,  p.  48)  gives  the  following  series  expansion 
for  tne  bivariate  normal  density 

co 

h(t,u)  ^  [1  «■  l  Pkn^(t)n„  (u)]  f(t)  f(u),  (2.6.16) 

k-1  K  K 

where  f(t)  is  the  density  of  the  univariate  standard  normal  random 
variable  and  {n^t*)}  is  a  set  of  orthonormal  Hermite  pol ynonomials . 
Thus,  if  p  >  0,  bivariate  normal  distributions  possess  the  PDE 
property  as  well. 

Example  2.6.2:  A  class  of  bivariate  densities  due  to  Farlie  Gumbel 
Morgenstern  is  given  by  the  formula 

h(t,u)  -  1  ♦  a( 1  2t)(l  2u),  where  0  <  t,  u  <  l 

1  <  a  < 1  (2.6.16) 

It  is  easy  to  check  that  T  and  U  are  PDE  for  a  >  0  in  (2.6.16). 

Note  that  the  expansion  2.6.16  has  only  a  finite  number  of  terms, 
unlike  the  expansion  for  the  bivariate  normal  distribution. 
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We  now  prove  that  the  PDM/PDE  structures  are  Inherited  by  a  pair 
of  new  variables  obtained  from  a  given  sample  by  computing  the  same 
function  of  the  marginals.  These  results  are  generalizations  of 
theorems  in  ShaSced  (19791.  which  were  proved  only  for  n-2.  However, 
mathematical  induction  does  not  help  to  show  the  results  for  an 
arbitrary  n. 

Ti 

Theorem  2.6.2:  Let  (j^),  1  =  1,2,  ....  n  be  a  random  sample  from  a 
PDM  parent  with  density  h(t,u).  Then,  for  any  measurable  function 


g ■  R°  -*  R, 

the  random  variables  g(T.j 

,T?,  ....  Tn)  and 

g(uru2,  . 

...  U^)  are  Jointly  PDM. 

Ti 

ProoX ;  By 

hypothesis,  the  vectors 

(q  )  are  i.i.d,  furthermore, 

s  i  nee 

PDM  property  Is  defined  only  for  exchangeable  pairs  of  random 
variables,  we  have 

(T^U.  >  ?  (U^T.),  1  =  1,2.  ...  n  Vs.*  17) 


Equal  ion  (2.6  17)  together  with  the  independence  of  T.U  pairs  yields 

lT, . V  U1 . V  --  <V  U? . •  T1 . V 


Consider  the  function  K:R?n  -*  Rn  defined  by  the  equation 


(2.6. 18) 


*(aI . an;  bl . V  -  (e(ai . an>  *  R(bi . V] 

wiiere  (a  .  ....  a  .  b.,  ..  .  b  )  is  any  point  in  R  n .  Applying  the 

1  n  1  n 


function  K  to  both  sides  of  (2.6  18)  and  invoking  Lemma  2 . 6 .  l  we  gel 


% 

% 

I 


rfl 

K*ll 


(g{T  ....  T  )  ,  g(U . U  ))  ^  (g(U . U  ).  g(T . T  )) 

l  ni  n  1  nl  n 


(2.6.19) 


Hence.  (g(T) .  g(U))  is  an  exchangeable  pair  of  random  variables 


The  PDM  property  of  (T^,  U^>,  i  -  1,2 . n  further  implies 


5 


% 

vo 

ft 


that  there  exist  n  J.  .l.d.  vectors  (  E„.  •  t-, .  •  ) .  i  -  1,2 . n  and 

*U1  11  d\ 


a  measurable  function  f  such  that 


(I)  For  each  j  ,  are  •*-*  un^variate  random  variables 

and  the  vector  Is  independent  of  E^  and  E2y 

(II)  For  each  j , 


Tt  =  and  u  -  f(,WW 


(2.6.20) 


Introducing  the  random  variables, 


■t  * 


i 

i 

s 


I 

% 

ft 


E"  =  E  .  E"  =  E 
U  *11*  ^2  <’21 


=  ^1?’  •••’  ^ln’  ^22 ’  ^Pn’  & 


■2n’  ^01* 


•  W  (2,t.2l) 


Ue  find  that  E“  and  E£  are  lid  univariate  random  variables  and 
is  Independent  of  E^  and  E!J  uiew  the  assumptions  (1)  and  (11). 
Note  that  (2.6.20)  and  (26  21)  Imply  that 

g<ii  ■  S(r<5ll.s01).  ....  f<Eln.i0n>) 


*r 


Is  a  measurable  function  g* ,  say,  of  EJ  and  ^ .  Similarly,  g(U)  Is 

also  the  same  function  g"  of  the  random  variables  E"  and  E" .  Hence. 

2  y0 


by  definition,  g(T)  and  g(U)  are  POM. 


I 


fcVW* It  fcV.Vr 
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The  next  theorem  is  similar  to  Theorem  2.6.2  except  the  parent 
distribution  has  the  PDE  property. 

Ti 

[  ’ 

Ui 

parent.  Then,  for  any  measurable  function  g:R"  *  R.  the  random 

variables  g(T  ,  ....  T  )  and  g(U,,  ...  U  )  are  PDE. 

1  n  1  n 

Proof:  The  exchangeability  of  the  joint  distribution  of  g(T)  and 

g(U)  has  already  been  proved  in  Theorem  2.6.2  (see  equation  2.6.19). 

It  remains  to  be  shown  that,  when  the  joint  density  of  each  of  the  n 
copies  of  T,U  admits  an  expansion  of  the  type  2.6.12,  the  joint 
density  of  g(T)  and  g(U)  also  admits  a  similar  expansion. 

Assume  therefore  that  there  exists  nonnegative  constants  {a^} 
and  a  set  of  orthonormal  functions  such  that  the  joint  density 

of  T^  and  U ^  is  of  the  form. 


dH(t1,u1)  -■  dF(tl)dF(ul)[l  +  l  ak\ <  t  L )  nk<  u  { )  ] 


(2.6.22) 


For  any  real  x,  define  the  measurable  set  in  R 


where  1  1,2, 

n 


ri . 


A(x)  =  ((x  ,  ....  xn):  g(xT . xn)  <  x)  . 


Then,  the  distribution  function  Q,  say,  of  (g(T),g(U>)  is 


Q(x,y>  =  J  ...  J  I  ...  I  n  dH ( t  , u  ) 
tt-.A  (  x )  ueA  ( y )  j  - 1  J  ' 


II;  i  Fig  t  lie  expansions  In  equation  (2.6.2?)  we  get 


(2.6.23) 


Q(x.y)  ^  Q(x)Qly)  *■ 


n  J-  akxk1> (x)x^1>(y)  ♦ 


(n) 

2 


k-1 


l 


a  .  ( 2 )  .  .  (2).  , 

Wk.i  (x?*k.i(y) 


(x)x 


(n) 

kr 


(y) 

n 


(2.6.24) 


n 

where  Q( X )  -  |  ...  J  H  dF( t  ) 

A(x)  i-1 


(  1  )  11 
X*  <X>  =  J  ...  I  nk(t1)  n  dF(t  > 
A  (  x )  i  =  1 


and 


(  2)  .  , 

v*.(  x 


A  ( x ) 


w  w 


n 

n  (iF  ( t . ) 

i-i 


(x) 


n  n 

\  ]  n  n  (t  )  n  dF  ( r. 

A(x)  1-1  l  1-1 


) 


(2.6.2**) 


Note  that  V  k  -  1,2.  ...  and  VI  =1,2,  ...,n  the  signed  measure 
( 4. ) 

Induced  by  x.  (x)  is  absolutely  continuous  with  respect  to  Q 

i - -  i. 
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(*) 

so  that  there  exists  (x)  ~  the  Radon-Nikodym  derivative  - 

K  i  •  •  *  »  K* 

(1)  1  ~  * 
of  x  (x)  with  respect  to  Q  such  that 


4l>  v  <x)  =  I  ^l>  <t>  dQ(t)  . 

1 .  I  -® 


(2.6.26) 


Hence,  from  equations  (2.6.24)  to  (2.6.26)  we  get 


dQ(x  ,y )  =  dQ(x)dQ(y)  [1  ♦  n  £  a^1*  t*)**1*  ly) 

k=l 


+  l  E  alc  ak  (x),|,k2)k  (y) 

k1=l  k2=i  1  2  1*  2  1*  2 


kl=1 


k  =1 
n 


_  (n)  ..,(11) 

4^1.  i.  tx)i|(u  ^ 

n  K1 . Kn  1 .  n 


(y) 


(2.6.27) 

Representation  (2.6.27)  holds  almost  everywhere  ( Q  measure)  because 
Radon  Nikodym  derivatives  are  defined  up  to  sets  of  measure  zero. 
Also,  the  coefficients  in  (2.6.27),  being  products  of  the  nonnegative 
a^'s,  are  themselves  nonnegative.  Hence,  to  complete  the  proof  we 
only  have  to  show  that  the  orthogonality  conditions  (2.6.13)  hold  for 
the  s  of  the  expansion  in  (2.6.27) 

For  t  =  1,2 . n,  and  1  <  k^ ,  . . . ,  k^  <  ®  , 


we  have 
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(  »  )  _ 

I  «!»/.  (t)  dQ(t) 

-®  1  *  '  ’ ‘ *  Kl 


=  Llm  x*4) (x> 
x->+®  Ki . 


k 


1 


»  oo  9.  n 

=  J  . • •  I  n  nk  (tt)  n  dF(ti) 

-tn  -cn  1  =  1  i  i  =  l  A 


CD 


00 


=  t  J  nk  (t1)dF(t1)H  J 

— oo  \  — oo 


«,  it 

l  n 

1  =  2 


(tl) 


n 

n  dF(t  )] 

1=2 


By  hypothesis  { ( * ) }  are  a  set  of  orthonormal  functions  on  the 
marginal  distribution  F(*}  of  T  so  that 


Hence , 


\  nk  (tx>  dF( t^ )  =  0 


J  ^U)(t)  dQ(  t)  =  0 


(2.6.28) 

(2,6.29) 


where  9=1,2,...  □ 

and  this  completes  the  proof. 

The  following  facts  about  bivariate  ranks  are  easy  consequences 

of  Theorems  2.6.2  and  2.6.3. 

Tt 

Corollary  2.6.1:  Let  (  )  be  a  random  sample  from  a  PDM  ( PDE ) 


1 


parent.  Consider  the  marginal  ranks 


11 


n 

=  l 

Q  =  1 


(T  >T  ) 
1~  a 


and 


Ru 

oi  T,  and  U.  respectively,  where  1  =  1,2,  ..  ,  n.  The  pair  (D  )  Is 

1  i  R?1 

PDM  ( PDE)  ,  i  =  1,2 . n. 

Proof:  Fix  i  and  define  a  function  g:  Rn  -»  R  by  the  equation 

n 


gi(al . an}  ^  1  ( a  >a  ) 

a=l  1“  (x 


and  observe  that 


"u  ■  *i<Ti . V'  "21  "  gi(V  •  V 


By  invoking  Theorems  ?.6.2  and  2.6.3,  the  result  follows.  □ 

We  need  one  more  result  before  we  establish  an  optimality  property 


of 


Ji. 


Theorem  2.6.4:  Let  random  vectors  (  ),  1  =  1,2,  ....  n,  be  PDM/ PDE 

-  Ul 

and  denote  the  ranks  of  T^.U^  among  T^'s  and  U ^ ' s  by  respec 

tlvely.  Consider  the  joint  probability  mass  function 

*ij  ■  p,"n  *  l-  "21  =  J)'  1  sl-  j  sn 

of  R,„  and  R„,  .  Then,  ir./s  satisfy  the  following  Inequal  t’ les : 
li  21  lj 


v  i.j. 


(  ? .6 . 3C  J 


Proof:  By  hypothesis,  the  parent  distribution  is  PDh  or  PDE.  Accor¬ 
ding  o  Corollary  ^.6.  I,  and  h  are  elso  PDM  or  PDE.  Corise 


and  R 
11  2  i 


ouent ly ,  R 


are  exchangeable  randoui  variables. 


Hence , 
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*ij  =  *J1’  r°r  1  -  l*  j  ^  n  (2.6.31) 

To  establish  (2.6.30),  first  consider  the  case  when  T  and  U  are  PDM. 

By  Theorem  2.6.2,  R  and  R  are  PDM.  Hence,  there  exists  a 
distribution  function  Q(*)  say,  such  that 

oo 

=  J  dQ{t>*  1  <  1.  J  <  n  (2.6.32) 

where  ir^.(t)  and  « .  j  { t )  are  the  conditional  mass  functions  of  R^ 
and  R  ,  given  a  value  t  from  the  Q-distribution. 

It  follows  from  equation  (2.6.32)  that 

'it  1  "jj 

CO 

-  I  [(*1.<t))2  +  (*#J(t>)2  -  2ir1>(t)w>J(t)]  dQ(  t) 

CD 

=  J  (*1#(t)  -  ir  (t))2  <JQ(t) 

—  OO 

>  0  . 

We  thus  obtain  (2.6.30)  when  T,U  are  PDM.  Suppose  now  that  T  and  U 
are  PDE.  Then,  by  virtue  of  Corollary  2.6.1,  R  and  would  be 
PDE .  R  and  R.^  are  ranks  that  are  based  on  Independent  random 
variables,  hence,  R  and  R0^  are  both  discrete  uniform  random  variables 
on  1,2,  ....  n  (see  Randies  and  Wolfe  (1979),  p.  38). 

As  R  and  R^  have  finite  supports  the  series  expansion  of  R^ 
and  R  will  have  a  finite  number  of  terms.  In  fact,  Fisher's 
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identity 


(see  Lancaster  ( 


f  1 


n-i 


k=l 


1969).  p.  90)  holds: 

SVnV1,) 


1  <  i .  J  <  n 


(2.6.33) 


where  (a^)  are  nonnegative  constants  and  are  orthogonal 

functions  on  1,2,  ....  n.  The  representation  (2.6.33)  leads  to  the 

following  reasoning: 

For  1  <  1.  j  <  n. 


n-1 


-  2w^  =  ~2  [1  +  l  a|f('1i(( )2^  1  + 


ii  n  U  t, 


k  =  l 


n-l 


n-1 


i  WJ>)  2  -  2  £  Vk(l)nk(J)] 


1  0  ^  2 
=  -2  l  a  Uk<l>  -  nk(J>]  1 

n  k=l 


>  0 


(2.6.34) 


Hence,  we  obtain  the  inequalities  in  (2.6  30)  .  An  optimality  of 

property  <<>"  can  now  be  established: 

Tt 

Theorem  2.6.6:  Let  (,  ),  1=1,2,  ....  n  be  os  in  Theorem  2.6.4. 

Then,  V  <p  fz  <$>, 

E(N(V)  )  <  E(N(<4>"  )  )  (2.6.34) 


Propf :  In  Corollary  2.6.1,  N(tp)  was  written  as  a  sum  of  exchangeable 

Indicator  random  varlanles .  Hence,  using  equation  2.6.11,  we  get 


E(N(<f )  )  ^  nP(R21  :  -<Rrl)) 


(2.6.36) 


n  I  P(R  -  v>(k),  R  =  k) 
k=l 


=  n  l  * 


k  ,<p(  k)  1 


where  w  Is  the  Joint  mass  function  of  Rn»R21-  Invoking  the 
Inequalities  on  w  in  (2.6.30)  we  obtain 


E(N(V))  <  n  -  (*ktk  ♦  %,(k).*><k))/2 


1  "  1  n 

nC2  *k,k  *  2  ^  %(k)  ,<p(k)] 


-  n  P<K21  -  Rll} 


P(N(<p-)  ) 


Which  establishes  the  desired  result. 


To  interpret  Theorem  2.6.5,  we  first  recall  from  subsection  2.4.2 


that  (p 


(1,2,  ....  n)  is  M.L.P  if  the  parent  density  has  the 


monotone  likelihood  ratio  (MLR)  property.  As  demonstrated  by  Shaked 

(1979),  there  Is  no  general  relationship  between  PDM/PDE  concepts  of 

positive  dependence  and  the  MLR  property.  We  can  therefore  state  the 

optimality  of  v"  In  Theorem  2. 6. 5  as  below: 

Let  T,U  have  a  Joint  density  that  has  MLR  property.  In  addition, 

let  T  and  U  be  either  PDM  or  PDE  random  variables.  Let  x, ,  ...  x  , 

1  n 

y7 .  ....  y(i  be  a  broken  random  sample  from  the  T  U  population.  Then 
the  M.L.P  <p“  Is  an  optimal  strategy  to  match  the  x’s  with  the  y's 
in  the  sense  of  maximizing  the  expected  number  of  correct  matches, 

2 . 7  Honotonlclty  of  E(N(»p“)) 
with  Respect  to  Dependence  Parameters 
Repairing  of  broken  random  samples  based  on  the  available  data 
in  two  files  was  discussed  in  Section  2.4.  It  was  observed  that 
data  based  optimal  matching  strategies  exist  when  data  come  from 
populations  having  certain  types  of  positive  dependent  structures. 

It  is  therefore  reasonable  to  expect  an  optimal  matching  strategy  to 
perform  better  when  there  is  some  kind  of  positive  dependence  in  the 
population  than  when  the  data  in  the  two  files  are  stochastically 
Independent.  Our  objective  In  this  section  Is  to  present  a  precise 
account  of  such  Intuitive  results  with  regard  to  the  maximum  llkeli 
hood  pairing  tp" .  To  this  end,  we  will  draw  upon  the  results  of 
Section  2.6.  We  begin  with  a  definition  from  Shaked  (1079): 
Definition  2.7.1:  Let  J  be  a  subset  of  R .  A  kernel  K  defined  on  JxJ 
Is  said  to  be  conditionally  positive  definite  (c.p.d)  on  ,JxJ  Iff 


b2 

(I)  K(x,y)  --  K(y,x),  V  *,y  6  J;  that  Is  K  is  a  symmetric  kernel. 

(II)  Let  m  be  any  positive  integer.  For  arbitrary  real  numbers 

a,,  ....  a  and  for  every  choice  of  distinct  numbers  x, , 
in  l 

x  from  J,  It  holds  that 
m 

mm  m 

X  X  K(x,x)aa>0  whenever  X  a  -  0  (2.7.1) 

1  1  J  -1  i  J  1  J  1  =  1 


It  is  pertinent  to  note  that  this  definition  is  related  to  the 

well  known  concept  of  a  positive  definite  kernel,  which  is  used  In, 

among  others,  the  theory  of  characteristic  functions.  The  nonnega- 

m  m 

tlvity  of  the  quadratic  form  \  X  K(x  ,x  )  a. a,  without  requiring 

i  =  l  l~-l  J  J 

m 

the  condition  X  a  =  0  In  (2.7,1)  Is  a  standard  way  of  defining 
1-1 

positive  definite  Kernels  (Widder,  1941,  p.  271).  We  shall  now  give 

an  example  of  a  c.p.d  kernel  which  will  be  used  In  the  sequel. 

F.x  ample  _2.._7 .1 :  Let  J  {1,2 . n}  ,  where  n  Is  a  fixed  positive 

integer.  To  verify  that  the  kernel  K(x,y)  -  ^  Is  conditionally 

positive  definite  on  JxJ,  let  m  be  a  positive  integer.  For  arbitrary 

real  numbers  a, ,  ....  a  and  for  every  choice  of  distinct  integers 
1  m 

1,  ,  ....  1  from  J,  we  have 
l  m 

m  m 


X 

a  =  1 


B  ^1 


K(1  ,tR) 

01  D 


a  aR 

a  B 


-  I  l 


a  a 
a 


B 


b3 


=  \  a° 

a=l 


>  O 


(2.7.2)  , 


where  we  have  used  the  fact  that,  in  view  of  the  integers  i^,  ....  1 


being  distinct,  1  iff  a=B. 

a  H 


m 


Note  that  we  did  not  have  to  impose  the  condition  £  a  ;  0  to 

i-1  1 

arrive  at  (2.7.2).  Also,  the  function  I(x-y)  is  clearly  symmetric  in 
x  and  y.  Hence,  it  follows  from  (2.7.2)  that  K(x,y)  is  positive 
definite  and,  consequently,  is  also  e.p.d. 

We  will  need  the  following  lemma. 

Lemma  2.7.1  (Shaked,  1979):  Let  T  and  U  be  PDM  or  PDE  random  vari 

ables  with  Joint  distribution  function  H(t,u).  Letting  F(*)  stand 

for  the  common  marginal  distribution  of  T  and  U,  define  H  (t,u)  * 

o 

F  ( t ) • F ( u )  ,  the  distribution  function  of  T  and  U  In  the  case  of 
independence  of  the  variables.  Then  we  have  the  ordering 


Eh(K(T,U))  >  Eh  (K(T,U) ) 
o 


( 2 . 7 . 3) 


iff  K  (  .  ,  .  )  is  a  e.p.d  kernel,  provided  the  expectations  exist. 

Theorem  2.7.1:  Let  the  joint  density  of  T,U  have  MLR  property 
(2.4.4).  Let  H^.H  be  as  in  Lemma  2.7.1.  [f  N  -  N ( tp** )  is  the  number 
of  correct  matches  due  to  the  M.L.P  cp" ,  then 

Eh ( N )  >  l  (2.7.4) 


*  -el  b  n  V *  » n  H 


\  *>-* 


* !_ «  * 


*£.«!  ♦SyD  •*— A  9 


Proof :  It  Follows  from  the  general  representation  of  N(<p)  In 

equation  (2.6.11)  that 


Eh(N)  =  n  Ph<Ru  =  R21)  -  n  BH<(K(R11,B21)) 


(2.7.6) 


where  K(x,y)  =  I  .  Now,  recall  from  example  2.7.1  that  K(x,y)  Is 

1  x =y ) 


c.p.d.  on  the  domain  JxJ,  where  J  --  {1,2,  ...,  n)  Is  the  common 


support  of  R  and  R  .  It  was  established  In  Theorems  2.6.2  and 


2.6.3  that  R  and  are  PDM  (PDE)  according  as  T  and  U  are  POM 


(PDE).  Invoking  Lemma  2.7.1,  we  therefore  obtain 


eh(k<rii’1V>  ?  eh 

O 


(2.7.6) 


Under  H0,  and  R are  independent.  Also,  these  ranks  are 

marginally  discrete  uniform  random  variables  on  1,2,  ....  n.  Hence, 

we  get 


E„  ■  PH  "'ll  *  ^l1 

o  o 


X  P(RU  =■•  k)  P(K?l  -  k) 
k  =  l 


n  ! 

I  -2 

k  - 1  n 


l/n 


(2.7.7) 


Equations  (2.7.6)  to  (2.7.7)  imply  the  desired  inequality: 


Eh(N)  >  n  ■  ■-  .  1 


a 


a 

H 


i 


S’ 

.s' 


tvs 


We  conclude  from  (2.7.4)  that  <p"  provides,  on  the  average,  more 


I 


1 


I 

8 

w 


i 

I 

5 
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correct  matches  when  the  data  in  the  two  files  come  from  certain 
positively  dependent  populations  than  when  they  are  independent.  In 
particular,  this  fact  holds  for  the  bivariate  normal  distribution 
with  positive  correlation  as  well  as  for  Morgenstern  distributions 
in  Equation  (2.6.14),  where  the  dependence  parameter  a  >  0.  In  the 
light  of  Theorem  2.7.1,  it  is  natural  to  conjecture  that  E^IN),  as  a 
functional  of  the  distribution  function  H,  is  order  preserving  with 
regard  to  certain  partial  orderings  of  the  space  of  all  continuous 
bivariate  distributions  which  have  fixed  marginals  (those  of  T  and  U) 
and  exhibit  positive  dependence.  Although  no  proof  of  this  conjee 
ture  is  available  at  this  time,  we  offer  further  evidence  in  support 
of  this  conjecture  in  the  next  two  theorems. 

Theorem  2.7.2:  Suppose  that  a  broken  random  sample  comes  from  the 
family  of  densities  given  by  the  equation 

h(t,u)  =  1  +  a  (1  2t ) ( 1 -2u ) ,  0  <  t,  u  <  1  and  0  <  a  <  1  (2.7.8) 

Then,  Ea(N)  is  monotone  increasing  in  a. 

Proof:  Note  that  in  (2.7.8),  a  ^  0  means  T  and  U  are  independent 
and  we  might  say  that  the  farther  a  is  from  0  the  more  the  positive 
dependence  between  T  and  U.  For  this  family,  the  marginal  distribu¬ 
tions  of  T  and  U  are  uniform  on  [0,1], 

It  follows  from  equation  (2.6.27)  and  Corollary  2.6.)  that  the 
joint  probability  function  of  the  ranks  R  and  !i^  ran  be  canon  1 
rally  expanded  as  follows: 
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1j 


P<Rll  =  *21  = 


“Z  tl  ♦  l  <£)  akn.  (i)nv{j)] 
n  k-l 


(2,7, 9) 


where  l,j  -  1,2,  ....  n  and  ln^(*)}j  Is  a  set  of  functions  satisfy 
ing  the  orthogonality  conditions  in  (2.6.13).  Using  the  expression 
(2.7  9)  for  ir  we  get 

E  (N)  =  n  P(R  =  R„.) 
a  1121 


n  I  « 
i  1 


ii 


n  n 


-  n  •  1°  v  E  E  (^ )ak(n.  ( U )?] 


1  =  1  k=l 


,  n 

1  1  V  /  U  .  .  ^ 

1  +  "•  l  <■>&  “ 

K  k 


n 


k=l 


(2.7.10), 


where,  after  change  of  the  order  of  summations  on  i  and  k,  we  have 
used  nonnegative  constants  b^  given  by  the  equation 


l  (  n.  (  l )  )  ,  k  i  ,  2 ,  -  n 

i  -1  K 


it  follows  from  (2.7.10)  that  Ea(N)  is  a  polynomial  in  <*  and  hence 
it  increases  with  a,  as  a  goes  from  0  to  1 .  □ 

Theorem  2.7.3:  Suppose  that  a  broken  random  sample  comes  from  the 
bivariate  normal  distributions  given  by  (2.6.15),  where  we  assume 


.“jvTiAT- 
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that  the  correlation  parameter  p  is  nonnegative.  Then  E  (N)  is 

P 

increasing  in  p. 

Proof:  It  follows  from  equation  (2.6.27)  and  Corollary  2.6.2  that 


=  P(R11  =  4-  RP1  = 


^2  d  +  n  1  p  ij1  (i) 


k  =  l 


k  (U  V  (J) 


ra  oo 

r  r  2(2) 


*  <2>  l  1  p‘  ID  »<2).  <J> 

k  =1  k2=1  12  X’  2 


♦  kX-l  •  •  •  KX-1  *ln)  k(1)  k(jn 

*1  V1  K1 .  n  K1 .  n 


(2.7. 11) 

where,  for  fixed  t  =  1,2,  ....  }  is  a  set  of  ortho- 

*1 .  n 

gonal  functions  on  {1,2,  ....  n}.  Using  the  expression  (2.7.11)  for 


,  we  obtain 
li 


E  (N)  =  nP(  R  ,  .  ) 

p  1121 


n  1 
i  -i 


e>8 


l 

n 


[n  + 


n*p 


n 


l  l 

k  =  i  i-i 


(U) 


2 


+ 


(S> 


n 

I 

1  =  1 


(*<2)k  U))2 

1*2 


+  P 


I  l 


V1 


V1 


<®  n 

i  i 

k  =1  1=1 
n 


(«P 


(n) 

kl* 


,k. 


<!>)]. 


(2.7.12) 


where  the  order  of  surrar.atlons  over  1  and  kj . kn  have  been 

reversed  because  the  terms  In  the  expuns  Ion  (2.7.11)  are  all  non 

negative.  We  conclude  from  (2.7.12)  that  E  (N)  is  a  polynomial  in 

P 

p  and  hence  It  increases  with  p  as  p  goes  from  0  to  l,  □ 

As  we  close  this  section,  we  shall  state  a  result  due  to  Chew 

(1973)  which  somewhat  resembles,  though  conceptually  different  from. 

the  inequality  E„(K)  >  1  In  (2.7.4).  Recall  the  notation  M(<*>)  in 

(2.4.9),  which  denotes  the  posterior  expected  number  of  correct 

matches  due  to  the  strategy  v>-  Arguing  that  H(v>)  =  1  when  n>  )s 

randomly  chosen  from  ♦.  he  proved  the  following  result: 

Theorem  2.7.3:  (Chew.  1973):  Let  ,  ....  x  and  y .  y  be  a 

broken  random  sample  from  a  bivariate  distribution  possessing  mono 

tone  likelihood  ratio.  If  x.  <  ...  <  x  and  y  <  ...  <  y  ,  then  the 

l  n  i  n 


b9 


posterior  expected  number  of  correct  pairings  using  the  M.L.P  <p"  is 
at  least  unity,  that  Is 

M(«p“ )  >  l  (2.7.13) 

It  should  be  noted  that  the  inequality  (2.7.13)  was  derived 
from  a  Bayesian  perspective,  whereas  in  our  inequality  (2.7.'*)  the 
expectation  is  over  all  possible  samples.  Finally  note  tnat  while 
our  comparison  is  between  dependent  and  independent  pofulatlons  for 
the  M.L.P.,  Chew's  inequality  compares  M.L.P  with  random  pairing. 


2 . 8  Some  Properties  of  N((p“,c) 

The  maximum  likelihood  pairing,  ,  was  introduced  in  sub 
section  2.‘\.2  and  some  of  its  small  sample  properties  were  studied 
in  Section  2.7.  Specifically,  the  behavior  of  E(N(<*>"))  was  discussed 
while  holding  the  sample  size  n  constant  and  changing  only  the  degree 

of  dependence  in  the  population.  We  shall  now  fix  the  parameters 

T 

describing  dependence  in  the  population  of  C  u >  and  allow  n  to  tend  to 
infinity  in  order  to  study  the  behavior  of  N(<f*,c>.  Later,  in  this 
section,  we  shall  present  the  results  of  a  Monte  Carlo  study  about 
N ( i*>"  , c  )  in  which  we  vary  the  dependence  parameters  even  as  n  takes 
different  values. 

Ln  this  section,  the  notations  of  Section  2.2  will  be  used 
freely  Recall  that  N(<p")  and  N(tp",t)  have  the  shorter  notations  N 
and  N  { c )  respectively.  We  start  with  a  review  of  Yahav  (l(*8?)'s 


results  concerning  E(N(c)), 


60 

Assuming  that  trie  distribution  of  T  and  U  Is  such  that  the  con 

dttlonal  distribution  of  U  given  that  T  -  t  Is  (univariate)  normal 

with  mean  t  and  variance  l,  Yahav  (1982)  derived  the  limiting  value 

of  rjn<c>  -  E(N(c)/n),  as  n  -»  by  using  the  representation  (2.6.2) 

in  which  the  summands  are  functions  of  the  order- stat 1st lcs  of 

IJ,  ,  .  .  .  ,  11  and  the  concomitants  of  the  order  statistics  of 
1  n 

T  ,  ...  Tn .  His  proof  relied  on  an  approximation  theorem 
(Blckel  and  Yahav,  1977)  about  the  order-statistics  for  the  above 
model.  Furthermore,  he  reported  the  findings  of  a  Monte  Carlo  study 
for  a  particular  case  of  his  model,  namely,  T  and  U  are  bivariate 
normal  with  correlation  p. 

First,  we  discuss  the  large  sample  behavior  of  N(c)/n  in  case  of 
samples  from  an  arbitrary  population.  The  properties  of  Its  expected 
value  are  available  as  a  consequence.  Second,  we  Indicate 
how  Yahav's  simulation  study  of  the  small  sample  properties  of  y^U) 
can  be  improved  upon.  We  shall  then  present  the  results  of  our  own 
Monte  Carlo  study  of  un(c)  when  n  is  small. 

Theorem  2.8.1:  For  broken  random  samples  from  an  absolute  iy 

N ( c )  Pr 

continuous  distribution,  ^  •*  y(c),  as  n  »  <*>,  (2.8.2) 

where  y  (  c  )  -  P(F(T  r.  )  <  C(U)  <  F(T*  c  )  ) . 

Proof:  Let  L.  =  Recall  the  representation  (2.6.6)  for  N(c)  as 

— -  n  n 

a  sum  of  exchangeable  Indicators: 

n 


(2.8.3) 


i 


I 

I 

I 


It  follows  that 


E(L  >  =  nP( A  ( c ) )/n  =  P(A  _<e)) 
r  n  l  n  l 


Note  that. 


2  -2  ( 2  i 

E(l/>  =  n  l  B(N(  e  )  )  '  *  *  +  E(N( c ) )]  , 


( 2 ) 

where  E(N(e))  Is  the  second  Factorial  moment  of  N(c).  Using 
exchangeable  representation  (2.8.3)  again,  we  get 


2  2  (2) 

E(Ln)  --  n  [n  p<  Anl  (  c  >  An2<  *  >  >  +  nP(AnlU>)] 
n 

I.et  n,  =  l  5,  , 

la  1=1  lal 


i 


’2a  =  ^  ^2ai  ’  °  *  1,2 . n* 


H 


where  the  sequences  [£..  ,}  and  {£.,  .}  are  defined  in  (2.2.12) 

lal  ceil 

Using  (2.8.6),  we  get 


and 


A  Ac)  -  <n,,/n  <  0,  n  /n  <  0) 
n  I  11  21 


2  2 

A  ( c  )  A  ( c  )  =  O  n  <n  /n  <  0) 
nl  "2  i.i  j.i  1J  ~ 


% 

fi 


M 
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(2.8.4) 

(2.8.6) 

the 


(2.8.6) 

(2.8.7) 

(2.8.8) 


f>2 


Note  that,  given 


the  infinite  sequence 


^112  ’  ^113  ’ 


ad  inf. 


is  exchangeable.  Hence,  by  the  Strong  Law  of  Large  Numbers  ( SLLN ) 
for  exchangeable  random  variables  (see  Chow  and  Teicher,  p.  223), 


a*s 

nu/n  -»  as  (2.8.9) 

where  the  conditional  expectation  is  equal  to  P(t^-c)  -  G(u^).  It 
follows  from  (2.8.9)  that 


n,  ,  /n  a-*'*  F(T  c)  G(U,)  (2-8.10) 


Ue  can  show  by  similar  arguments  that 


a*s 

n,  /n  -»  F(T  c )  -  G(U  )  (2.8.11) 

la  a  a 


a»s 

/n  *  G(U  )  F ( T  »c)  (2.8.12) 

2a  a  a 

where  a  =  1,2. 

Using  the  fact  (see  Serf  ling,  191s  0  p.  52)  that  a  sequence  of 
vectors  converges  almost  surely  to  a  given  vector  iff  the  component 
wise  sequences  converge  almost  surely  to  the  appropriate  components 
of  the  limit,  we  get  from  (2.8.11)  and  (2.8.12) 


1 
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a*s 


F(T  -  c  ) 

G(U^  ) 

G(U1) 

F(T1*c) 

F(T2  c) 

G(U?) 

G(U2) 

F(T2«c) 

(2.8.13) 


It  follows  from  (2.8.7),  <2.8.8),  (2.8.13)  and  the  independence  of 

<T*  *t* 

<U^)  and  (u2> 


P(AnlU>)  ■*  V,U) 


(2.8.14) 


and 


P(Anl(«  >An2<  £  > )  "*  v  ( £  > 


(2.8. 16) 


Using  (2.8.4),  (2.8.6) ,  (2.8.14),  (2.8.16)  it  is  easy  to  verify  that, 
as  n  **>, 


E(  L  )  -*  m(c) 
n 


and 


(2.8.16) 


var(L  )  -»  0 
n 


It  is  well  known  that  (2.8.16)  implies  the  convergence  In  probability 

in  ( ? .8 .2) .  11 

The  following  corollary  generalizes  Yahav  (198?)'s  result  concerning 

u  (c),  the  first  moment  of  N(c)/n. 
u 

Uorol lary  2 . 8 . 1 :  Kor  p  ^  0, 


1 


(1) 


V»(  c  )  . 


as  n -*». 


6<t 

(  2  .  8 . 1 7  ) 


§ 


(11)  E(N(c)/n)P  *  Wt!)P,  as  n*». 


(2.8.18) 


Proof:  The  number  of  t  correct  matches  can  at  most  be  n,  the 

number  of  pairs  In  the  unobserved  bivariate  data.  Hence, 


0  <. 


NU) 

(l 


V  n  .  1,2, 


In  other  words,  {N(c)/n}  is  a  uniformly  bounded  sequence  of  random 
variables.  It  Is  well  known  that  convergence  In  probability  and 
convergence  are  equivalent  for  such  sequences.  Hence,  (1)  is  an  easy 
consequence  of  Theorem  2.8. i.  It  follows  from  (1)  and  Theorem  <* . 5 . 4 
of  Chung  (1074)  that  the  ptfl  moment  of  N(c)/n  converges  to 
tvi(c)]p.  Hence  (11)  also  holds.  O 

Note  that  no  assumption  about  the  conditional  distribution  of  U 
given  T  was  made  either  in  Theorem  2.8.1  or  Corollary  2.8.1. 

Yahav  generated  samples  from  a  bivariate  normal  parent  with  mean 
vector  ( Q )  and  covariance  matrix 


2  2 
P  /{Ip  ) 


/ { 1  ) 


2  2 
P  /(I-P  ) 


1/  (1  P  ) 


(2.8.19) 


Note  that  In  (2.8.19)  the  variances  of  T  and  U  are  functions  of  the 
correlation  of  T  and  U  because  Yahav  r  -quires  that  the  conditional 
distribution  of  U  given  T  -  t  be  i  ormal  with  mean  t  and  variance  1. 


I 


I 

R 


a 

8 


I 


n 


1 


I 


I 


1 

H 


R 


I 

1 

i 

l 

R 
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The  limiting  value  of  y^c)  for  his  particular  model  was  given  by 


the  integral 


yU) 


I  {*(x 


c  .  , ,  VI -p 

r)  -  *  x  — tL 
p 


|))  d*(x) 


(2.8.20) 


He  computed  y(e)  by  numerical  integration  for  c  -  0.01,  0.05,  o.l, 
0.3.  He  also  provided  Monte  Carlo  estimates  of  y^U),  for  n  =  10, 
20  and  50  using  the  simulated  data  on  T  and  U.  The  following  table 

is  a  typical  example  from  his  tables. 


Table  2.1  Expected  Average  Number  of 
e -Correct  Matchings,  c  -  .01 


(Yahav  (1982)) 


p 

V10U) 

M20(C) 

V50U) 

y(e) 

.01 

.5864 

.5326 

.52752 

.52269 

.01 

.1984 

.1648 

.12712 

.  11522 

.10 

.1512 

.1058 

.07600 

.05912 

.30 

.  1084 

.0686 

.03888 

.02144 

.50 

.  1020 

.0582 

.02720 

.0138? 

.  70 

.0960 

.0614 

.02616 

.01051 

.90 

.0972 

.0540 

.02064 

.00864 

.95 

.0976 

.0496 

. 02144 

.00829 

.99 

.0960 

.0484 

.02128 

.00804 

It  is  clear  from  Table  2.1  that  y  ( c )  and  y(c)  are  decreasing 

n 


as  p  ranges  from  0.01  to  0.99.  However,  one  expects  that  an  optimal 


strategy  such  as  <p"  has  the  property  that  y^tc)  as  well  as  y(c)  are 


monotone  increasing  in  p.  The  problem  here  is  not  with  the  M.L.P, 
4>" ,  but  with  Yahav's  model  in  (2.8.19)  because,  as  the  correlat  ion 
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changes  Its  value,  so  do  the  marginal  variances  of  T  and  U.  To 
rectify  this  problem,  we  assumed  a  bivariate  normal  model  for  T  and  U 
In  which  the  means  were  zero  and  the  covariance  matrix  was 


For  each  combination  of  four  values  of  n,  namely  10,  20,  60  and  100, 
and  twelve  values  of  p,  namely  0.00,  0.10  (0.10),  0.90,  0.96,  0.99, 
a  sample  of  size  1000  was  generated  from  the  bivariate  normal  popula¬ 
tion  using  the  IMSL  subroutines.  These  data  were  used  to  obtain 
Monte  Carlo  estimates  of  p^lc),  where  c  was  given  the  values  0.01, 
0.06,  0.1,  0.3,  0.6,  0.76,  1.0.  Furthermore,  it  is  easy  to  show 
that,  for  the  model  in  (2.8.21), 


vi(c)  =  P(  |  Z  |  <  c/V?TFpT),  (2.8.22) 

where  'L  Is  a  standard  normal  random  variable.  It  is  clear  from 

(2.8.22)  that  v(c)  is  a  monotone  increasing  function  of  p.  Using 

standard  normal  CDF  tables,  y(e)  in  (2.8.22)  was  computed  for  each 

combination  of  the  twelve  values  of  p  and  the  seven  values  of  c 

mentioned  above.  We  have  presented  the  estimated  values  of  v  (c) 

n 

and  the  limiting  value  y(e)  In  Table  2.2  to  Table  2.8. 


i 


Table  2.2  Expected  Average  Number  of 
c-Correct  Matchings,  c  =  0.01 


p 

Mio(c> 

P20U) 

So{c) 

W*1 

v(c) 

0.00 

0.106 

0.054 

0.025 

0.015 

0.008 

0.10 

0.113 

0.059 

0.028 

0.017 

0.008 

0.20 

0.127 

0.068 

0.031 

0.018 

0.008 

0.30 

0.138 

0.075 

0.034 

0.020 

0.008 

0.40 

0.155 

0.083 

0.038 

0.023 

0.008 

0.50 

0.174 

0.095 

0.044 

0.026 

0.008 

0.60 

0.199 

0 . 109 

0.051 

0.030 

0.008 

0.70 

0.231 

0.129 

0.061 

0.036 

0.008 

0.80 

0.279 

0.162 

0.077 

0.046 

0.016 

0.90 

0.374 

0.222 

0.109 

0.067 

0 . 016 

0.95 

0.476 

0.296 

0.151 

0.094 

0.024 

0.99 

0.700 

0.521 

0.299 

0 . 191 

0.056 

Table  2 . 3  Expected  Average  number  of 
c-Correct  Matchings,  c  -  0.05 


(- 

) 

V 

L0<C) 

?o(c> 

P50U) 

Mioo(c) 

U< 

t ) 

0  . 

.00 

0 

.  127 

0 

.076 

0 

.047 

0 

037 

0. 

032 

0  . 

10 

0 

.134 

0 

.082 

0 

.051 

0  . 

040 

0  . 

032 

0 

.  20 

0 

.  149 

0 

.093 

0 

.056 

0. 

,043 

0. 

032 

0  . 

.  30 

0 

.  161 

0 

.099 

0 

.061 

0. 

047 

0. 

032 

0 

.  40 

0 

.  180 

0 

.  109 

0 

.066 

0  . 

052 

0  . 

040 

0 

.50 

0 

.  201 

0 

.  124 

0 

.074 

0. 

.057 

0  . 

040 

0  . 

.60 

0 

.  228 

0 

.141 

0 

.085 

0. 

065 

0. 

048 

0 

.  70 

0 

.  262 

0 

.  166 

0 

.  101 

0 

.076 

0. 

048 

0 

.80 

0 

.317 

0 

.  205 

0 

.124 

0. 

.094 

0. 

064 

0, 

.90 

0 

.420 

0 

.280 

0 

.  174 

0. 

.135 

0. 

088 

0 

.95 

0 

.529 

0 

.  368 

0 

.  237 

0 

,  186 

o. 

127 

0. 

,99 

0 

.  769 

0 

.631 

0 

.459 

0. 

377 

0. 

274 
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Table 

2.4  Expected  Average 
Correct  Matchings,  c 

Number  of 
=  0.1 

p 

w10(c) 

W20U) 

VWU> 

Mioo(t) 

VJ  <  G  ) 

0.00 

0.164 

0 . 102 

0.076 

0.066 

0.066 

0 . 10 

0 . 160 

0. 110 

0.080 

0.069 

0.066 

0 . 20 

0.177 

0.121 

0.087 

0.074 

0 . 064 

0. 30 

0 . 189 

0. 130 

0.093 

0.080 

0 . 064 

0.60 

0.210 

0 . 143 

0 . 101 

0.088 

0.072 

0  .  60 

0.234 

0 . 161 

0.112 

0.096 

0.080 

0 . 60 

0.264 

0 . 181 

0.127 

0. 108 

0.088 

0 . 70 

0.302 

0 . 210 

0.149 

0. 126 

0. 103 

0 . 80 

0.363 

0 . 268 

0.182 

0. 164 

0. 127 

0.90 

0 . 477 

0 . 347 

0.264 

0.218 

0. 174 

0.96 

G.694 

0.462 

0.342 

0.299 

0.261 

0.99 

0.839 

0 . 744 

0.630 

0.680 

0.622 

Table  2 . 6  Expected  Average  number  of 
c  Correct  Matchings,  e  ~  0.3 


P 

viou) 

U20U) 

V,60U) 

viooU) 

n(c  ) 

0 . 00 

0 . 266 

0 . 208 

0.184 

0. 176 

0 . 166 

0 . 10 

0.266 

0.22  3 

0.196 

0 . 186 

0 . 174 

0 . 20 

C  .  284 

0 .237 

0.207 

0 . 197 

0.190 

0 . 30 

0 . 306 

0.263 

0.221 

0.211 

0. 197 

0 . 40 

0 .334 

0. 276 

0 . 240 

0 . 229 

0.213 

0 . 60 

0  363 

0. 304 

0.263 

0.260 

0 . 236 

0.60 

0.401 

0 .3  36 

0.293 

0 . 278 

0 . 266 

0 . 70 

0.466 

0 , 382 

0.337 

0.3  20 

0.303 

0.80 

0.632 

0.467 

0.403 

0 . 336 

0. 362 

0.90 

0.670 

0.693 

0 . 640 

0.619 

0 . 497 

0.96 

0.802 

0.733 

0.689 

0.674 

0.668 

0.99 

0.978 

0.968 

0.961 

0.961 

0.966 
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Table  2.6  Expected  Average  Number  of 
c-Correct  Matchings,  c  =0.6 


p 

»*io(c> 

W20U) 

1*60(C) 

wioo(c) 

v(  c ) 

0.00 

0.363 

0.311 

0.290 

0 . 281 

0.274 

0.10 

0.367 

0 .330 

0.306 

0 . 298 

0.289 

0.20 

0 . 390 

0.348 

0.326 

0.316 

0.311 

0.30 

0.417 

0.371 

0.344 

0 . 336 

0. 326 

0.40 

0.462 

0.400 

0 . 373 

0 . 362 

0. 354 

0.50 

0.486 

0.437 

0.404 

0.393 

0.383 

0.60 

0.628 

0.478 

0.446 

0.435 

0.425 

0.70 

0.691 

0.636 

0.606 

0.495 

0.484 

0.80 

0.676 

0.628 

0.694 

0.584 

0.570 

0.90 

0.811 

0.773 

0.752 

0 . 744 

0.737 

0.96 

0.917 

0.896 

0.888 

0.885 

0 . 886 

0.99 

0.998 

0.999 

0.999 

0.999 

1 . 000 

Table  2 . 7  Expected  Average  number  of 
c-Correct  Matchings,  c  -  0.76 


P 

wio(c) 

W20U) 

P50<t) 

‘iioo(c> 

u(  r  ) 

0.00 

0.468 

0.433 

0.416 

0 . 409 

0 . 404 

0  10 

0 . 488 

0.454 

0.437 

0 . 429 

0 . 425 

0 . 20 

0.514 

0.477 

0.461 

0.453 

0.445 

0.30 

0.539 

0.505 

0.487 

0.480 

0 . 471 

0.40 

0.582 

0.542 

0.522 

0.514 

0.503 

0 . 50 

0.621 

0.586 

0.560 

0.555 

0 . 547 

0.60 

0.662 

0.633 

0.613 

0.606 

0 . 599 

0.70 

0.727 

0.694 

0.679 

0.673 

0 . 668 

0 . 80 

0.810 

0.786 

0.772 

0 . 768 

0 . 766 

0.90 

0.919 

0.908 

0.906 

0.904 

0 . 907 

0.95 

0.979 

0.976 

0 . 9  78 

0.979 

0.982 

0.99 

1.000 

1.000 

1.000 

1 .000 

1 .000 

Table  2.8  Expected  Average  Number  of 
t  Correct  Matchings,  c  ~  1.0 
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f> 

p10(c) 

*20U> 

V80(C> 

uioo(c } 

W  (  c  } 

0.00 

0.870 

0.848 

0.831 

0.824 

0.822 

0. 10 

0 . 803 

0 . 866 

0.888 

0.849 

0.847 

0 . 20 

0.621 

0.808 

0.881 

0.876 

0.870 

0. 30 

0 . 646 

0.622 

0.611 

0.608 

0.608 

0.  '*0 

0.600 

0.664 

0.680 

0.644 

0.627 

0 .  bO 

0 . 720 

0 . 707 

0.601 

0.688 

0.683 

0 . 60 

0 . 772 

0 , 783 

0.744 

0.741 

0.737 

0. 70 

0.830 

0 .812 

0.807 

0.808 

0.803 

0.80 

0.808 

0.880 

0.887 

0.888 

0.886 

0 . 00 

0 . 070 

0.070 

0.072 

0.072 

0.970 

0.08 

0.006 

0.006 

0 . 007 

0.097 

0.908 

0 . 00 

1 .000 

1.000 

1 . 000 

1.000 

1 . 000 

Note  that,  as  expected,  vj^U)  is  a  monotone  increasing  function 
of  p  for  each  fixed  c.  Furthermore,  the  quality  of  the  merged  file  is 
quite  good  if  we  want  to  recreate  contingency  tables  with 
intervals  of  size  .bo  or  more  and  the  correlation  p  is  >0.8. 
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2.9  Poisson  Convergence  of  N(<t>*‘) 

Let  us  revisit,  for  a  moment,  the  card-matching  problem  which 
was  discussed  In  Section  2.3.  Some  of  the  distributional  properties 
of  the  number  of  correct  matches  In  randomly  arranging  one  pack  of 
cards  against  another  were  stated  in  Proposition  2.3.1.  In  partlc 
ular,  the  well  known  approximation  of  the  distribution  of  the  number 
of  correct  matches  by  a  Poisson  distribution  with  mean  1  was 
mentioned.  This  Poisson  approximation  may  be  motivated  by  the 
observation  that  the  occurrence  of  a  match  tends  to  be  a  rare  event 
when  the  number  of  cards  In  the  matching  problem  grows  indefinitely. 
Inspired  by  this  result,  it  is  natural  to  ask  whether  Poisson  dlstrl 
but  Ions  can  approximate  the  distribution  of  the  number  of  correct 
matches  due  to  data  based  matching  strategies.  The  answer  is  In  the 
affirmative  in  the  case  of  the  maximum  likelihood  pairing  <p" .  Our 
aim  In  this  section  Is  to  establish  the  Poisson  convergence  of  N(<p“) 
Using  the  general  representation  In  Corollary  2.6.1  for  the 
number  of  correct  matches,  we  can  write 

N  N  (<p“ )  =  l  I  (2.9.1) 

14  nl 

where  Anl  =  (R]^  =  R^),  i  =  1,2,  ....  n  are  exchangeable  events.  It 
follows  that  E(N)  -  nPlA^).  Zolutikhlna  and  Latlshev  (1978) 
sketched  a  proof  of  the  fact  that  the  expectation  of  N  converges  to  a 
constant  as  n  tends  to  ■».  Their  approach  starts  with  writing  P(A  ) 
as  the  triple  Integral 


r.-r,' 
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GO  <X>  p 

~  I  J  J  exp[(n  l)ln(s(x,y,0))]d6dH(x,y) 

— ®  -oo  6=0 

where  s(x.y,G)  =  P3(x,y)  +  2»'p^TxTyTp^Tx7yT  *  eos26, 
p^x.y)  -  F  ( x )  -  H(  x  ,y  )  , 

P2<x,y)  -  G(y)  H(x,y)  , 

and  P3(x,y)  =  1  -  p^x.y)  -  p2(x,y),  V  x,y  e  R  0  <  6  <  ir  . 

Using  the  well  known  method  of  Laplace  (Bleistein  and  Handlesman 
1975),  they  expanded  this  integral  in  powers  of  ~  and  concluded  that 
P( An| )  =  ®  for  large  n,  where  the  constant  a  Is  given  by 

CD 

a*  J  (h(x.G "  lF(x))/h  (G"1F(x))]dx  (2.9.2) 

-  GO 

The?y  concluded  that,  In  large  samples,  E(N)  ^  a. 

In  this  section,  we  shall  generalize  the  result  of  Zolutikhlna 

and  Latishev  (1978)  by  showing  that  the  dth  factorial  moment  of 

N,  E(N^),  converges  to  ad,d  >  1,  under  certain  conditions  on  the 
T 

distribution  of  (|j)-  As  a  consequence,  we  shall  obtain  the  weak 
convergence  of  M  to  the  Poisson  distribution  with  mean  a. 

We  begin  with  the  observation  that  the  ranks 
Ft,  -  (R,,.  R,  )  and  R_  -  (R_,,  ....  R  )  are  Invariant  under 

increasing  functions  of  T  and  U  respectively.  For  this  reason,  N  is 
also  Invariant  under  such  transformations.  Without  loss  of  general 
lty,  wp  therefore  replace  T  and  U  by  F(T)  and  COM  respectively, 
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where  F(G)  is  the  marginal  distribution  function  of  T(U).  This  so- 

called  probability  integral  transformation  allows  us  to  assume  that 

T  and  U  are  marginally  uniform  random  variables  and  that  the  parent 

CDF,  H(t,u),  is  the  Joint  CDF  of  F(T)  and  G(U)  .  Furthermore,  the 

1 

integral  (2.9.2)  simplifies  to  a  =  J  h(x,x)dx.  We  might  recall 

0 

from  Section  2.2  that  this  simpler  version  of  a  was  called  X.  We 
shall  henceforth  use  these  simplifications  and  seek  to  prove  that  N 
weakly  converges  to  the  Poisson  distribution  with  mean  X. 

Following  Schwelzer  and  Wolff  (1981),  the  Joint  CDF  of  F(T)  and 
G(U)  will  be  called  a  copula .  In  general,  a  copula  is  denoted  by  the 
symbol  C(  .  , . )  and  the  following  Frechet  bounds  apply  to  any  copula: 

max(xvy-l.O)  <  C(x.y)  <min(x,y),  V  (x,y)  e  [0,1]2  (2.9.3) 

However,  for  the  purpose  of  deriving  the  distribution  of  N,  we  shall 
consider  only  a  part  of  the  spectrum  (2.9.3)  of  all  possible  copulas. 
To  motivate  our  choice  of  the  copulas,  first  note  that,  in  this 
chapter,  only  absolutely  continuous  Joint  densities  are  allowed  for 
T  and  U.  This  means  that  the  extremes  mln(x*y  1,0)  and  mln(x.y)  are 
ruled  out  because  these  copulas  correspond  to  degenerate  Joint 
distributions  for  T  and  U  (Mardla  1970,  p.  32).  Second,  Goel  (197b) 
has  observed  that  <p"  (1,2,  ....  n)  is  M.L.P  iff  the  joint  density 

of  T  and  U  has  the  MLR  property.  However,  MLR  property  neces 
sarlly  Implies  that  the  distribution  function  of  ((j)  must  be  such 
that  C(x,y)  >  xy.  for  all  (x,y>  in  the  unit  square  ( TonR  (1980), 
p.  80).  We  shall  henceforth  assume  that  the  Joint  CDF  of  T  and  (I  will 
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satisfy  the  inequalities 

xy  <  C( x , y )  <  mln(x.y),  V  <x,y)  e  1 0 . 1 1 2 .  (2.9.4) 

Note  that,  in  (2.9.4),  T  and  U  are  Independent  Iff  C(x,y)  =  xy . 

Positive  dependence  of  T  and  U  occurs  when  C(x,y)  >  xy,  for  all  x  and 
y.  In  the  remainder  of  this  section,  the  joint  CDF  of  T  and  U  will 
be  a  copula  C  in  the  class  (2.9.4)  and  the  corresponding  joint  density 
function  will  be  denoted  by  c(x,y). 

Since  R^  and  R?  are  some  permutations  of  (1,2 . n),  we  find 

It  convenient  to  use  the  notation  <p  for  realizations  of  or  R?. 

The  common  support  of  R^  and  R^  is  denoted  by  *,  the  set  of  n! 
permutations  of  1,2 . n. 

Ue  will  now  formally  establish  an  equivalence  between  the  card  matching 
problem  and  the  fl.L  P  In  the  Independence  case. 

Proposition  2.9.1:  her  T  and  U  be  independent  random  variables. 

Then  the  distribution  of  V  =  (V  ......  V  >  defined  In  (2.2.6)  Is 

~  ni  nn 

the  same  as  that  of  the  vector  6  5  (A,,  ...,  A  )  where 

1  n 

A  ,  =  I,D  ,.,1*1.2 . n  (2.9.6) 

n  1  ( R ,  ^  i ) 


Furthermore,  the  random  variables  4^,  ....  An  are  exchangeable. 
Proof:  Note  tnat  the  rank  vectors 


(R 


11 


R.  > 

In 


and  R0 


(R, 


21 


*2n> 


are  Independent  because  T  and  U  are,  by  hypothesis.  Independent 
random  variables,  and  that  R,  and  R_  are  discrete  uniform  on  ip 

•1  -i 
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That  Is, 


P(R  =  <<>)  =  — ,  V  <p  £  *  and  a  =  1.2. 
n ! 


(2.9.6) 


As  Vn^'s  are  Indicators  of  the  occurrence  of  matches,  the 

Bernoulli  variables  6  .  4  in  (2.9.6)  can  be  looked  upon  as 

nl  nn 

indicating  whether  R  matches  with  1  or  not,  1-1,2,  ....  n.  It  is 
clear  that  the  common  support  of  V  and  6  Is 


A  =  {{a . a  )  : a  =0  or  1,  1  =  1,2 . n,  l  a  *  n  1} 

1  n  l  1-1  1 


(2.9.7) 


Note  that  A  has  2n  n  sample  points. 


Let  a  =  (a  .  a  )  be  a  fixed  but  otherwise  arbitrary  point 

~  1  n 

in  A.  Define  the  events 

D(a.-p)  -  [«  e  *:I((HU^(l))  *  V  1  . n)> 


(2.9.8) 


where  ip  6  ♦.  Then,  using  the  Independence  of  R^  and  R^  and 


(2.9-8)  we  get 


PI?  ■  S'  ■  OI,  „  ,  -  »!•  I  -  1.2.  ....  "> 

11  7’  1 


E  P(I(Rirv(i))  V  1  :  l,?' 


...  n|F<?  '4>> 


EZp"(Rn-,U)  ai  ’  . 


E  P(R  t  D(  a  ,  14! )  ) 


(2 .9.9) 
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We  now  observe  that  the  components  of  a  dictate  which  positions 
of  if  (<f(l)  .  -  •  ip(n))  must  be  matched  or  mismatched  by  any  permu 

tation  il»  in  order  that  41  e  D(a,<g).  Clearly,  the  number  of  ways  in 
which  we  can  permute  the  integers  1,2,  ....  n  and  produce  <|>'s  that 
belong  to  D(a,if)  depends  only  on  the  fixed  vector  a  and  the  fact  that 
f  is  an  arrangement  of  n  distinct  integers.  Hence  the  cardinality  of 
D(a,ip)  does  not  change  as  if  ranges  over  $.  In  particular,  D(a,if) 
and  D(a,if")  have  the  same  number  of  sample  points,  where 
ip"  -  (1,2 . n)  .  Using  (2.9.6),  we  therefore  obtain 

n'C1  P  D  (  a  .  if )  )  -  P(R1  6  D  (  a ,  if " )  )  ,  V  4>  6  *  (2.9.10) 


The  right  hand  side  expression  in  (2.9.10)  is  a  fixed  number  depen 
ding  on  ip"  and  the  chosen  a.  This  means  that  in  (2.9.9),  we  seek 
the  expectation  of  a  degenerate  random  variable.  Hence,  we  obtain 

P  (  y  -ri)  --  P(R1  e.  D(a  ,if"  )  ) 


-■=  P(  A  --  a) 


(2.9. 11) 


Because  a  was  arbitrarily  chosen  from  A,  we  finally  infer  from 
(2.9. 11 )  that 


(  V 


nl 


...  V 


nn 


6  ....  A  1 

n  1  nn 


(2.9.12) 
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The  exchangeability  of  A^,  ....  A^  follows  from  the  fact  that  the 
distribution  of  R^  is  uniform  over  ♦. 

It  readily  follows  from  Proposition  2.9.1  that.  In  the  indepen¬ 
dence  case. 


n 


l 

1  =  1 


nl 


nl 


(2.9.13) 


n 

In  view  of  (2.9.13),  if  we  let  Z  =  I  A  ,,  then  the  exact  as  well 

n  l=1  nl 

n 

as  asymptotic  distributions  of  N ( <p**  >  =  £  V  can  be  derived  by 

1  =  1  nl 

studying  Zn,  which  Is  same  as  the  no.  of  matches  in  the  card  matching 
problem.  As  stated  In  Proposition  2.3.1,  the  asymptotic  distribution 

of  Zn  is  Poisson  with  mean  1.  We  now  present  another  proof  of  this 

well  known  result.  The  novel  part  of  our  proof  Is  that  we  establish 

certain  dependence  properties  of  A()1 . A  ar.d  consequently 

derive  the  limiting  distribution  by  using  only  the  first  two  moments 

of  Z  . 
n 

Our  program  can  be  stated  as  below: 

(i)  Show  that  A  's  have  a  certain  positive  dependence  structure, 
ni 

(it)  Invoke  a  theorem  due  to  Newman  (1982)  to  arrive  at  the  Poisson 
convergence  of  N  in  the  independence  case . 

We  start  with  the  definitions  of  some  concepts  of  dependence  of 


random  variables. 


78 


Definition  2.9.1  (Lehmann,  1966):  x^  and  are  said  to  be  positive 
quadrant  dependent  ( PQD )  iff 

P(xt  >  x  _  >  x?)  >  P(x1  >  Xj)  P(x?  >  x?),  V  x  ^ ,  x?  t 

(2.9. U) 

Definition  2.9.2  (Newman,  1982):  x^,  ....  xn  are  said  to  be  linearly 

positive  quadrant  dependent  (LPQD)  iff  for  any  disjoint  subsets  A,B 

of  {1.2,  ...  n)  and  positive  constants  a,,  ...,  a  , 

l  n 

V  ax  and  £  a.x,  are  PQD.  (2.9.16) 

,  *  k  k  ,  k  k 

keA  kOB 

Def  init  ion  2  , _9_.  3  (Esary .  Proschan ,  Walkup,  1967)  :  xl . xn  are 

said  to  be  associated  iff  for  every  choice  of  functions 
f  (Xj,  ....  x^)  and  f?{x^,  ....  *n>.  which  are  monotonlc  Inci  :>s  1  rig 

in  each  argument, 

cov(f  (x. ,  x  ),  f  (x  ,  ....  x  ))  >  0,  (2.9.16) 

11  n  2  1  n  — 

provided  f|(x^,  ....  xn)  and  f2(xi . xn)  have  finite  variance. 

l*:  is  well  known  that  association  Is  a  stronger  property  than 

LPQD  property  of  n  random  variables  x^,  ....  x  .  Ue  will  now 

establish  that  A  ,  .  .  .  ,  f>  In  (2.9.6)  possess  a  weaker  version  of 

n  l  mi 

the  LPQD  property. 

Lemma  2  9.1:  For  k  1,2,  n  1 , 

k 

£  6  ,  and  A  are  PQD.  (2.9.17) 

_  ni  nn 

Proof:  Fix  k  -  1,2,  .  ..,  n  l.  Then,  using  (2.9.  I'D,  we  see  that 
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l  A  .  and  A  are  PQD  if 
°1  nn 


p< J  4„l  >  V  4nn  =•  -a’  i  4nl  4  V  pl4,m  >  V'  v  'l'  x*eR 


(2.9.18) 


Because  A  .'s  are  binary  random  variables  we  obtain 
nl 


1  IF  x2  <  0 


P<4  >  x  )  = 

nn  2 


(2.9.19) 


0  if  x2  >  1 


It  is  clear  from  (2.9.19)  that  (2.9.18)  holds  for  any  Xj,  provided 
x2  <  0  or  x2  >  1.  Hence,  it  suffices  to  show  (2.9.18)  for 

o  <  x„  <  1.  However,  if  0  <  x,  <  1,  then  (A  >  x_)  =  (A  -1). 

It  therefore  remains  to  be  shown  that 


p(  E  4ni  2.  1. 

1=1  nl 


S  =  1)  >  P<  l  A  .  >  l)  P(A 
nn  ~  ^  ^  ni  -  nn 


=  1). 


V  l  =  0,1 . k. 


(2.9-20) 


By  definition  of  A 


ni. 


P(Anl  -  1)  =  P(RU  =  1)  - 


(2.9.21) 


and  P( A  .  =  0)  =  1  -  - 
ni  n  . 


k 

Writing  P(  I  A  >  l)  in  the  form 
1=1  nl 
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n 

P<  l 
1  =  1 


ni 


>  1. 


6 

nn 


n 

0W  P(  X  *  >  t. 

1=1 


a 

nn 


1) 


and  using  (2.9.21)  we  can  rewrite  (2.9.20)  in  a  more  useful  form: 


p<  l  An,  >  l|4nn  =  0)  <  P(  X  «  .  >  t|Ann  -  1). 
^  ^  ni  nn  ^  ^  ni  nn 


1  =  0,  . . . .  K  . 


(2.9.22) 


Note  that.  In  (2.9.22),  k  Is  a  fixed  integer.  For  a  given  k, 

we  now  fix  the  value  of  1  and  proceed  to  establish  the  inequality 

In  (2.9.22)  by  means  of  a  combinational  argument. 

It  Is  clear  that  we  can  express  the  event  (a  =  0)  or 

nn 

n-1 

as  u  (R  =  a).  Hence  we  can  write, 

11- 1 1 


k 

(  l 


1=1 


V  *  *• 


a 

nn 


0) 


n-1 


u  J 
a=l  “ 


(2.9.23) 


where 


k 

J  =  (  I  A  .  >  l.  R,ri  -  «).  « 
o  ni  in 


1.2.  • • - .  n-1 


(2.9.24) 


Observe  that,  in  (2. 9. 24),  J  's  are  mutually  disjoint  measure 

a 

able  subsets  of  ♦.  Let  us  now  fix  a  =  1,2,  ....  n-1  as  well.  Then, 

any  permutation  if  in  Ja  satisfies  tf(n)  =  a  and  (<f(l),  ....  ( n- 1 ) ) 
is  an  arrangement  of  the  integers  1,2,  ....  o(-l,o(+l,  ...,  n  producing 

at  least  l  matches  of  the  type  <f(  I)  =  i  in  the  positions 
1  =  1,2 . k.  On  the  other  hand,  any  permutation  if  in 
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Ic 

(  l  4  .  >  t,  4  =1)  satisfies  <p(n)  =  n  and 

nl  _  nn 

(<p(l),  ....  <f(n-l))  Is  an  arrangement  of  the  Integers  1,2,  ....  n-1 
yielding  at  least  l  matches  such  as  ip(i)  =  1  In  the  positions 
1  =  1,2,  ...,  k.  Because  a  *  n,  It  is  clear  that 

k 

#(J  )  <  #(  l  4  >  t,  4  =  1)  ,  (2.9.25) 

a  ^  ^  nl  nn 

where  #(A)  denotes  the  cardinality  of  the  set  A. 

Since  a,  k  and  l  were  arbitrary  choices,  we  get  from  (2.9.23), 

k  *  ic 

#(  l  *  .  >  1.4  -  0)  <  (n-1)  n  l  4  >  t , 4  =  1) 

,  nl  -  nn  .  ,  nl  -  nn 

1=1  1=1 

k  =  1 , 2 . n-1;  l  *0 . k  (2.9.26) 

Since  Is  discrete  uniform  on  *  It  follows  from  (2.9.26)  that 


k 

P<  1 
1=1 


4  >1,4 

nl  ~  nn 


=  0)  <  P(  l 
i=l 


4  >1,4 

nl  ~  nn 


=  1)  •  (n-1) 


(2.9.27) 


Multiplying  both  sides  of  the  Inequality  In  (2.9.27)  by  n  and  using 
(2.9.21)  we  establish  (2.9.22),  which  implies  that  (2.9.20)  holds.  □ 
Ve  now  state  two  useful  results  due  to  Newman. 


Lemma  2.9.2  Newman  (1982):  If  %l  and  are  POD,  then 


82 


|£(exp( irx^+lsXj) )  -  E(exp(lrx1))  E(exp(i3x,,)| 


<  |rs|  cov{x1<x2)  for  all  r.s  £  R 


(2.9.28) 

Q 


Lemma  2.9.3  Newman  (1982):  Suppose^that  x  , n-..t  x  are  LPQD.  Then 


n 


n  n 


I*  x  <*% . r  >  -  n  »  < r  )  1  <  l  l  Ir  r  |  cov.x  ,x  ) 

xl*‘'',  n  1  n  3=1  J  k=l  1=1  K  1  k  * 

k  <  t 


V  r . r  €  R  , 

l  m 


(2.9.29) 


where  fs  are  given  by 


n 


*  „  =  E(exp( 1  l  r  x  )) 

1 .  n  3=1  J  J 


*x  =  E(exp(  1  TjX^),  3  a  1,2 . n. 


□ 


Suppose  now  that  we  choose  the  arguments  r^,  ....  rn  in  (2.9.29) 

equal  to  an  arbitrary  real  number  r,  say.  Assume  further  that 

x^,  .  .  .  ,  x^  are  exchangeable  random  variables  so  that  they  have 

common  characteristic  function,  namely  'f  (r)  and  that  the  covariance 

X1 

between  any  pair  of  the  x^'s  is  equal  to  covjx^.x^).  It  follows  from 
(2.9.29)  that 


I'PjX  (r>  "  *x  'r>*  ~  ”^2  cov(*ltx2) 


(2.9.30) 


Thi3  estimate  for  approximating  the  characteristic  function  of  J  x 

i  =  l  1 


I 


i 
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by  the  product  of  the  marginal  characteristic  functions  of  the  x*s 

depends  on  the  fact  that  x,  ,  ...  x  are  LPQD.  We  now  use  Lemma 

l  n 

2.9.2  and  show  that,  with  regard  to  the  variables  4  ,  ...,  4  , 

nl  nn 

an  estimate  similar  to  (2.9.30)  can  be  obtained  under  the  weaker 
version  of  the  LPQD  property  which  is  given  by  (2.9.17). 

Lemma  2.9.4:  Let  4^'s  be  the  Bernoulli  variables  in  (2.9.5)  and 
n 

let  Z  =  X  4  .  Then, 

"  U1  nl 


s 


S3 

ffi 


8 


8 


8 


l»2  <r»  -  ♦;  ,'r>l  5  in2  “O*1*  J.4  j). 

n  nl 


V  n  >  2 ,  r  efl  , 


(2.9.31) 


Proof:  The  exchangeability  of  4  ,  ...,  4  was  established  in 

nl  nn 

Proposition  2.9.1.  Hence,  we  obtain 

COV<4ni*4nj)  =  COV<4nl,4n2>  ’  v  1  *  (2.9.32) 

(D  i  *  (r)  ,  V  j,  (2.9.33) 

nj  ni 

Note  also  the  well-known  property  that 

It  (r)|  <  !,  V  J  and  V  r  (2.9.34) 

*nj 


From  Lemma  2.9.1,  we  have 


l  4  and  4  are  PQD,  V  k  -  1,2,  ...  n-1 

^  ^  nl  nn 


In  view  of  the  exchangeability  or  6nl . 4nn,  we  can  restate  this 

property  of  the  4  '  s  as  follows: 


8 
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Let  A  and  B  be  non-empty  disjoint  subsets  of  {1,2,  ....  n)  such 
that  B  is  a  singleton .  Then 


5"  A  .  and  7  A  .  are  PQD 
leA  nl  lie  ni 


(2.9.35) 


Fix  n  >  2  and  consider  the  following  finite  sequence  of  statements: 


m 

U1  ‘nl 


(r)  -  ^  (r)l  <  ^-1J  |r|2  Cov(Anl,An2), 


nl 


V  m  =  2,3 . n  (2.9.36) 

Note  that  (2.9.31)  is  obtained  from  (2.9.36)  by  letting  m  =  n.  Ve 

shall  now  establish  (2.9.36)  by  induction  on  m. 

By  choosing  A  =  {l},  B  =  {2}  in  (2.9.35),  we  find  that  A^  and 

An2  are  PQD’  Th8  Lemma  2<J-2  readily  implies  that  (2.9.36)  holds  for 

m  =  2.  Now.  let  us  assume  that  (2.9.36)  holds  for  m  =  2,3,  ....  (n-1) 
n  n-1 

Splitting  7  A  ,  as  the  sum  of  7  A  ,  and  A  ,  vie  infer  the  PQD 
i-1  ni  1-1  nn 

n-1 

property  of  7  A  and  A  from  (2.9.35).  Hence  we  obtain  again 
nl  nn 

from  Lemma  2.9.2  and  (2.9.32) 


I* 


1  * 
i  =  l 


(r )  -  ¥  1  (r)  •  V,  (r) 

n-i  o 

nn 


ni 


1=1  *ni 


<  I  r  |  2  cov(^  *nl.*nn) 


=  | r |  (n-1)  cov(Anl,An2) 


(2. 9. 3D 


Now,  we  shall  Invoke  the  induction  hypothesis  that  (2.9.36)  holds  for 


I 


I 

I 


I 


ur 

k* 


i 


g 
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m  =  n  -  1 .  Using  <2. 9. 33)  to  (2.9.37)  we  finally  establish  (2.9.36) 
for  n  =  n  as  follows: 


!*„  <«*> 

I  * 


f?  <D| 

nl 


1=1 


nl 


<  |¥  (r)  -  ¥  .  (r)  •  f .  (r) | 

-  n  n  - 1  o 

A  *-*  A  *■*  "" 


|Tn-l  <r'  ft  <rl  -  (r>l 

v  .  nn  nl 


<  | r |  (n~l)  cov(Anl,«n2) 


*  i*„.i  •'»-«:>  i 

l  A  nl 

1=1  nl 


<  !  r  |  2  (n-1)  cov(4nl,4n2)  «-  !  r  | 2  •  ^n-lMn-2)  Cov(«nl.«n2) 


I r | 2  covU^.A^)  (n-l)tl  ♦  t~] 


~if11  |r|2  COV(4nl  in2) 


(2.9  38) 


The  proof  of  (2.9.36)  13  complete  by  our  Inductive  argument  and 
(2.9.31)  follows  from  (2.9.38).  □ 


86 


Our  preparations  so  far  in  this  section  arc  adequate  for  the 
purpose  of  establishing  the  Poisson  convergence  of  N  in  the 


independence  case. 

Theorem  2.9.1:  Let  T  and  U  be  independent  random  variables.  Let 
the  number  of  correct  matches,  N,  be  given  by  (2.9.1).  Then 


N  ■*  Poisson  (1),  as  n  -* 


(2.9.39) 


Proof :  We  obtain  from  (2.9.13) 


N  9.  2  , 
-  n 


where  ^sing  the  exchangeability  of  4^’s,  we  obtain 


COVUnl,An2>  =  P(Rir  l,R12=  2>  “  CP<  R11  =  1 }  3  (2-9-40) 


Since  P(F  =1,F  =2)  11  l/n(n-l),  it  follows  that 


n(n-l)  cov( A  ,4  )  =  “  .  V  n  >  2, 

nl  n2  n 


and  therefore 


n(n-l)  cov(6  ,  ,4  „)  =  0(1)  as  n 
nl  n2 


(2.9.41) 


The  proof  of  (2.9.39)  consists  of  showing  that  the  characteristic 

function  of  Z  converges  to  the  characteristic  function  of  the 
n 

Poisson  distribution  with  mean  1.  In  other  words,  we  shall  show  that 


f  (r)  -»  exp(exp(lr)  -  1),  V  r  e  R  as  nH 
n 


(2.942) 


To  this  end,  Lenina  2.9.4  gives  the  following  estimate  of  the 
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difference  between  the  characteristic  functions  in  (2.9.49) 

IT-  (r)  -  exp(exp(ir)  -  1)| 

4n 


<  IT.  (r)  -  ( r )  |  ♦  |¥  (r)  -  exp(expdr) 


1) 


nl 


nl 


<  l«*|2  cov(4  ,4  )  *  l'r4  <r)  "  exp(exp(ir) 

nl 


1)1 


(2.9.43) 


Now,  using  the  distribution  of  given  by  (2.9.21)  we  get 

*.  (r)  *  (1  +  J  (exp(ir)  -  1)1  . 

o  _  n 

nl 

Clearly, 

(r)  -»  exp(exp(lr)  -  l),  Y  r  6  R,  as  n  ■  (2.9.44) 

nl 

It  readily  follows  from  (2.9.41),  (2.9.43)  and  (2.9.44)  that  (2.9.42) 
holds.  Hence  we  obtain 


d 

Z  -*  Poisson  (1)  (2.9.45) 

n 

which  is  equivalent  to  (2.9.39).  □ 

We  now  assume  that  the  broken  random  sample  comes  from  a 
population  in  which  T  and  U  are  dependent  random  variables.  It 
should  be  noted  that  extensions  of  some  of  the  techniques  used  in 
the  proof  of  the  Poisson  convergence  in  the  independence  case  to  the 
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dependence  case  are  not  available  at  thi3  time.  Specifically,  no 
proof  of  the  counterpart  of  (2.9.17),  namely 


l  V  ,  and  V  are  PQD  V  k  =  1,2,  ....  n-1 ,  V  o  >  2 
^  nl  nn 


(2.9.46) 


is  known.  However,  direct  verification  of  the  association  of 


V  ,  ,  ....  V  has  been  carried  out  for  n-2,3,4  when  T  and  U  have  the 
nl  nn 


Morgenstern  distribution  given  by  (2.6.16).  Since  association  of 
random  variables  is  a  much  stronger  dependent  structure  than 
(2.9.46),  it  is  natural  to  conjecture  that  Lemma  2.9.1  holds  even 
when  T  and  U  are  dependent . 

In  the  absence  of  a  valid  proof  of  Lemma  2.9.1  in  the  depen¬ 
dence  case,  we  need  extra  conditions  on  the  distribution  of  T  and  U 
in  order  to  derive  the  Poisson  convergence  of  N.  The  following  lemna 


will  be  useful  in  deriving  the  main  result  of  this  section. 


s 

~n 


Lemma  2.9.6:  For  a  fixed  d,  let  L  =  —  and  L  =  (L, ,  ...,  LJ)', 
- -  ~n  n  -  1  d 


S  and  L  are  defined  in  Section  2.2.  Then, 
~n  ~ 


a  .  s 

L  -*  L,  as  n  *  ® 


(2.9.47) 


Proof :  Fix  d  >  1.  t  13  clear  from  the  definitions  of  in 

(2.2  10)  and  the  sigma-field  in  Section  2.2  that  the  infinite 
sequence 


^d+1'  ^d+2 1  . 


an 

B 


1 


i 


§ 


ffl 

<1 


tfv  .-■>  .>r?  ?-»«*  »iv 


89 


of  d-dimensional  vectors  are  conditionally  1.1. d  given  A..  Hence, 

a 


using  the  Strong  Law  of  Large  Numbers  for  exchangeable  sequences 


(Chow  and  Teicher,  p.  223)  we  get 


l  l*  *  B(WV 


(2.9.48) 


k=d+l 


In  order  to  evaluate  the  limiting  conditional  expectation  in 


(2.9.48),  note  first  that,  for  J  =  1,2,  ....  d,  and  are 


uniform  random  variables.  Now, 


E(  WTJ  -  V  °d =  V 


p(tj  -  Td.i  *  0)  -  p<u3  -  ud.i  *  0> 


P(Td+i<  tj)  -P(Ud+1<Uj) 


"  -  V 


=  V 


(2.9.49) 


Therefore,  it  follows  from  the  definition  of  in  (2.2.10)  and 


(2.9.49) 


s‘W'al  ■  "vL! . V 


(2.9.50) 


Hence,  (2.9.48)  and  (2.9.50)  imply  that 


11  a .  s 

X  L.  -*•  L,  as  n  -*  ® 
k=d+l 


(2. 9. 51) 


Also,  d  being  a  fixed  integer,  we  have 


' 


*v  -r- *r 
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1  ^  a .  a 

^  l  ^  -  2‘  ^  " 

tc=l 


a. a 


(2.9.52) 


Since, 


1  n 

^n  =  n.l  5 


k=l 


the  lemma  follows  from  (2.9.51)  and  (2.9.52)  Q 

The  following  sufficient  conditions  will  be  used  to  prove  the  next 
theorem. 

Assumptions :  In  the  notations  of  Section  2.2,  let 


(a)  X  <  ® 


(2.9.53) 


(b)  I  l*L(e) |  d6  < 


(2.9.54) 


and  (c)  P(H'“  <  t)  =  0(t  )  as  t  -♦  «*,  V  d  >  1 

a  — 


(2.9.55) 


Theorem  2.9  2:  If  Assumptions  (2.9.53)  to  (2.9.55)  hold,  then 


N  •*  Poisson  (X)  as  n  -»  » 


(2.9.56) 


Proof:  Proof  of  (2.9.56)  consists  in  showing  that  the  factorial 
moments  of  N  converge  to  those  of  the  Poisson  distribution  with  mean 
X,  in  other  words, 


E(N(d))  -»  Xd,  V  d  =  1,2, 


(2.9.57) 


By  the  Fourier  inversion  theorem. 


B 

a 


B 


B 

4 


3 


vV 


ms  Z&  ssai  35??  mt  pm  ms  && 
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I 

$ 

I 


P(S  =  0) 


(2*) 


-d 


v 

I 

-w 


J  fs  (0)  d0, 
-*  ~n 


(2.9.58) 


where  *s  (0)  Is  the  characteristic  function  of  the  d- dimensional 
~n 

random  vector  S  defined  In  (2.2.7). 

~n 

The  Assumption  (2.9.54)  ensures  that  the  Fourier  Inversion 

theorem  can  be  applied  to  the  continuous  random  variable  L.  Noting 
1 

that  A  =  J  c  ( x ,  x )  dx  Is  the  value  of  the  density  function  of  L  at  0, 
0 

we  get 


X  =  gL(0)  =  <2«) 


-1 


tL(t)  dt 


Since  Lj  =  Tj  -  0j .  J  =  1.2 . d,  are  1.1. d,  with  their  comnon  density 

function  equal  to  g  ( . )  It  follows  that 

L* 

m  ao 

Xd  =  (2*)~d  f  ...  J  »L(0)  d0  (2.9.59) 

00  ao  ~ 

Recalling  the  representation 


N(<p")  =  I  I, 


1  =  1  nl 


from  Corollary  2.6,1,  we  obtain 


E(N(d) )  =  n(d)  P( A  . A  ,  ...  A  . ) , 
nl  n2  nu 


=  n(d)  P(S  =  0), 
~n  ~ 


(2.9.60) 


where  n  =  n ( n  -  1 )  ...  <  n  -  d  +  1 ) . 


For  fixed  d,  it  is  clear  that  2  n4  as  n  ■*  «.  It  therefore 


follows  from  (2.9.60)  that,  in  order  to  prove  (2.9.67),  it  i3 
sufficient  to  show  that 


Lim  | A(d,n) |  =  0, 

n-K» 


where  A(d,<i)  »  ndP(S  =0)  -  \d 

~n  ~ 


From  (2.9.58)  and  (2.9.69),  we  obtain 


if  If  OO  00 

A(d  ,n)  =  nd(2w)~d  J  ...  J  *  (u>du-(2<) "d  J  ...  I  Y.  (0)d9 


-*  -x  ~n 
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(2.9.61) 


(2.9.62) 

on  making  the  change  of  variables  0  =  (nu^  ...,  nufl)  in  the 
first  teim  of  (2.9.62)  and  noting  that 


Y_  (0/n)  =  Y_  (0) ,  we  gee 


nir 


OO  CO 


A(d,n)  =  (2«)  u  J  ...  J  Y  (0)dg  -  J  ...  J  Y  (0)d0 


-nir  -mr  ~n 


(2.9.63) 


For  positive  constants  a  and  8,  which  will  be  determined 
later,  define  four  Integrals  as  follows: 


(2.9.64) 


l 


1 


I 
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(il)  J2(n) 
(111*  J3(n) 


J  ...  (  [T.  (6)  -  (0)]d0 

|e|<a  ~  ~  ~ 

l  ...  I  t  {g)de 


<iv)  J^(n)  =  J  ...  J  *  (6)d6 

Bn<|g|<irn  =n 


(2.9.65) 

(2.9.66) 

(2.9.67) 


It  Is  easy  to  verify  using  these  integrals  and  (2.9.62)  that 
* 

A(d.n)  =  U*)  l  J  (2.9.68) 

k-1  K 

For  appropriate  choices  of  a  and  B,  we  will  show  that 
|Jk(n)|-»  0  as  n  k  =  1,2. 3, a, 

which  will  imply  (2.9.61). 

Let  c  >  0  be  a  fixed  number.  Then,  assumption  (2.9.53)  and  the 
expression  (2.9.59)  Imply  that  1^(0)  is  absolutely  lntegrable 
on  Rd.  Therefore,  we  can  find  a  large  enough  a  such  that. 


|J  |  <  J  ...  J  |*  (0)ldO 
1  |6|>a  k 

<  c/9  (2.9.69) 

From  Lenina  2.9.5,  we  have 


a.s 

k,  -*  L  . 

~Tl  ^ 


which  implies  that  (cf*  Bhattacharya  and  Hanga  Rao,  1976,  p.44) 


fL  (6)  +  *L<e)  as  n  ^ 
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the  convergence  being  uniform  on  the  compact  subset 
{eteeR^and  |e|  <  <*} 

Hence,  for  the  a  chosen  above,  we  can  find  ^  such  that 
V  n  >  n^ , 

| J2<n) |  <  c/4  (2.9.70) 

In  order  to  show  that  |J3<n)|  -»  0,  we  transform  6  to 
r  =  9/n  in  J3  and  obtain 


J3(n)  =  n 


*<k 


.  I  (r)dr 

!<a  ~n 


(2.9.71) 


n 

Note  that  S  *  X  is  a  lattice  random  vector  so  all  its 
1=1 

Ti 

moments  exist.  Since  < )  are  i.i.d,  it 
follows  From  the  definition  of  £  in  (2.2.10)  that 

K<§n>  =  2  (2.9.72) 

It  was  argued  in  the  proof  of  Lemma  2.9.5  that,  for  all  n  >  d, 

^d+1*  •••*  £n  are  conditionally  i.i.d  given  A^  with  mean 

=  L,  V  j  =  d+i,  ....  n 

It  is  easy  to  verify  chat  the  dispersion  matrices  D(E.|A.), 

*3  d 

3  =  d+1,  ....  n,  are  positive  definite.  Moreover,  for 

3  =  1,2,  ...  d,  t  is  degenerate  given  A .  and 
j  d 
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[I 
i 
I 

I 


B 

S 

1 

I 


D(L)  .  a2 I, 

2 

where  o  =  var(T-U)  and  I  Is  the  dxd  identity  matrix. 
The  dispersion  matrix  of  S  is,  for  n  >  d, 


D(S  ) 
~n 


l  i4> 
1  =  1  1 


n  n 

=  S(t>(  I  5, |A  ))  ♦  0(B{  l  l  | A  ) ) 

1*1  1*1  Q 

=  (n-d)  EDtl-^lA^)  ♦  (n-d)2D(L) 


We  Finally  conclude  that 

D(Sn)  -  <n-d)2oZI  «  (n-d)  WH^jA) 


is  positive  definite. 

As  the  second-order  moments  of  S  exist,  we  expand  ¥  <.(r 

~n 

~n 

r*0  and  using  (2.9.72)  obtain 

log  fSn(r)  =  -  |  £’D(Sn)r  +  0(||r||2),  as  ||r||  •*  0 

In  view  of  (2.9.73),  we  obtain 

2 

|exp(log¥  (r) )  |  <  exp{-  o2||r||2  +  0||r||2), 

~n  2 

as  ||r  ||  -»  0 


(2.9.73) 


(2.9.74) 


around 


(2.9.75) 
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Hence,  there  exists  a  constant  B  >  0  such  that  for  n  >  d, 

|*s  (r)  |  <  exp{ -  J  (n-d)2  2  ||r||2), 

V  || r ||  <  B  (2.9.76) 

Now,  3  n„  such  that  V  >  n„  5  <  B  so  that  we  obtain  using 
2  —  2  n 

(2.9.72)  and  (2.9.76) 

|JL(n)|  <  nd  J  ...  J  exp(-  ^  (n-d)2  a2  ||r||2)  dr 

<|  ...  \  exp(-  J  2  ||r||2)  dr  (2.9.77) 

|0|  >  a 

It  Is  clear  that  we  can  choose  a  large  enough  a  In  (2.9.77)  such 
that  v  n  >  n2< 

U3(n)  |  <  c/4.  (2.9.78) 

Finally,  to  show  that  |J4I  ■*  0,  we  transform  u  *  0/n  In  (2.9.67) 
and  obtain 

I  J.  (n)  |  <  nd  I  IT-  (u) |  du  (2.9.79) 

B< ! u | <w  ~n 

In  view  of  the  earlier  remarks  about  the  conditional  distributions 

of  5, ,  ....  5  given  A.  ,  we  obtain  for  n  >  d, 

■*1  **n  a 

l*S  (u)|  <  KAd|T  (vr  w  ,(u)l  n'd  (2.9.80) 

~n  *d*l'=l . 2  » 

where  , (w, ,  ....  w. )  13  the  value  of  E  given 

■M+l  -*d+l  -1  ~d  '‘d+l 


I 

a 

■ 

a 

a 

a 

a 

a 

i 

i 

8 

\ 

I 

i 

\ 


i 
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^  (T^ ,0  J  ,  i  3  1,2,  ....  d.  Since  the  characteristic  function 

Y  (u)  is  uniformly  continuous  on  the  compact  set 
*d+l  ~ 

{ u :  6  <  | u |  <  w)  of  R  ,  it  attains  its  maximum  Inside  this  set,  say 

at  u  =  u".  Furthermore,  Y  has  period  2*  so  that,  for  almost 

M+l 

all  realizations  (w, ,  ....  w.), 

‘""'1  ,vO 


sup  |Y_  (u)|  <  1 

B<|u|<»  *d+l 


(2.9.81) 


Letting  Y“  =*  -  4n[Y_  (u*)],  we  get  from  (2.9.79)  and  (2.9.80), 

M+l 


I J4 1  <  nd  KAd(exp(-(n-d)YJ) 


(2.9.82) 


n  My.(n-d) 


where 


(B  <B  (J 

M(s)  -  f  ...  I  exp(-sY-)  n  dC(x. ,y .)  (2.9.83) 

0  0  J  J 

is  the  moment  generating  function  of  Y"  with  a  real  positive 
argument . 

Now,  using  the  Ahellan  Theorem  (cf.  Vidder  (1941),  p.  181),  we 
obtain 

,,  P(»a<t) 

Lim  sup  t  M^Jt)  <  Llm  sup[  d  r(d+l)]  (2.9.84) 

t-*«  tYO  ^ 

By  Assumption  (2.9.55),  the  right-hand  side  of  (2.9.84)  is  zero  and 


it  follows  that 


98 


d 

n  M*#{n-d)  -+  0,  as  n  -*  ®, 
d 

which  implies,  in  view  of  (2.9.82), 

|  ( n >  |  -»  0,  as  n  ■*  »,  (2.9.85) 

It  follows  from  (2.9.69),  (2.9.70),  (2.9.78)  and  (2.9.85)  that 

Liffl  |Md,n)  |  =  0 
n-*» 

The  convergence  of  factorial  momenta  in  (2-9.57)  follows  immediately, 

which  in  turn  Implies  the  Poisson  convergence  in  (2.9.56)  O 

The  validity  of  Theorem  2.9.2  depends  on  whether  the  Assumptions 

(2.9-53)  to  (2.9.55)  hold  or  not.  We  shall  now  given  some  examples  in 

order  to  illustrate  the  fact  that  these  Assumptions  are  not  vacuous. 

We  start  with  a  discussion  of  (2.9.53). 

2  2 

For  any  Copula  C(x,y)  on  [0,1]  ,  one  may  define  4>  (possibly  an 
infinite  it)  by  the  equation 

1  =  I  Si  ?x,y)  dx  dy,  (2.9.86) 

where  «(x,y)  =  dC(x.y)/dxdy  is  the  Radon -Nikodym  derivative  of  the 

T 

Jonit  distribution  of  (y)  with  respect  to  the  product  measure  of  T  and 

2 

U  (i.e.,  the  Independent  case).  C(x,y)  is  a  4>  -bounded  distribution 

2 

(with  marginal  uniform  distribution)  if  4>  <  +*. 

2 

The  class  of  <t>  -bounded  distributions  is  large,  as  is  evident 

from  the  following  general  result  (see  Lancaster  1969,  page  95). 

2 

Proposition  2-9.3:  If  H(t,u)  is  a  $  bounded  bivariate  distribu- 


« 

1 

I 

1 


I 

8 


I 
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tion  with  marginal  distributions  F { t )  and  G(u)  then  complete  sets  of 
orthonoraal  functions  1  =  1,2,  ....  can  be  defined  on  the 

marginal  distributions  such  that 

CD 

dH(t,u)  =  [1  4-  l  p  H..U)  n-4  (u)  ]  dF(t)  dG(u)  (2.9.87) 

i=l 

and  =  l  p2  (2.9.88) 

1=1  1 

It  may  be  recalled  from  (2.6.12)  that,  when  all  p^  >  0  in  the  above 

canonical  expansion  of  the  Joint  distribution  of  T  and  U,  we  say  T 

and  U  are  positive  dependent  by  expansion  (PDE).  It  follows  from 

2 

(2.9.87)  that,  when  a  copula  C(t,u)  is  -bounded,  X.  in  (2.9.53) 
can  be  evaluated  using  the  orthonormality  of  {t^}  as 

1 

X  =  J  c(x,x)dx 
0 

CD 

=  1  +  l  p .  (2.9.89) 

i=l 

It  follows  from  (2.9.88)  and  (2.9.89)  that  the  finiteness  of  <t>2  and 
X  are  related  to  each  other.  Specifically,  since  v  i  >  1, 
the  canonical  correlations  p^  <  1,  we  obtain 

2 

X  <  ®  *  <t>  <  « 


With  regard  to  the  Morgenstern  distribution  in  (2.6.16),  we  obtain 


i 


*1 


W* 


I  V 
l- V 

*vV 


,  *v 

■>’,i 


f* 


a  If 


0  If  1>1 


where  -1<«*<1.  However,  we  have 


X  -  J  c(x,x)dx 
0 


=  1  +  3  * 


which  Is  finite.  Similarly,  in  the  bivariate  normal  distribution 


given  by  (2.6.1b) , 


X  =  - —  ,0<P<1 

i.~P 


In  view  of  these  examples,  assumption  (2.9.53)  is  not  vacuous. 


Bhattacharya  and  Ranga  Rao  (1976)  (pp.  189-192),  gives  conditions 


that  are  equivalent  to  the  assumption  (2.9.54).  Ue  cite  one  here: 


Let  G  m  denote  the  nth  convolution  of  the  distribution  of 

Li 


L  -  T  -  U,  where  m  >  1.  If  there  exists  an  integer  m  such  that 


has  a  bounded  (almost  everywhere)  density,  then  the  modulus  of  the 


characteristic  function  of  L  is  integrable  on  ( -<•»,<*>)( that  is 


assumption  (2.4.54)  is  valid)  and  vice  versa. 


Another  sufficient  condition  for  absolute  lntegrablllty  of 


t  (6)  is  due  to  Bochner  and  Chandrasekar  (1949).  If  there  exists 

u 


a  bounded  (almost  everywhere  density  g  (t)  of  L  =  T  -  U  and  if  its 

L 


characteristic  function  1^(6)  is  (real)  and  nonnegative,  then 
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|*L(e)|  de  <  •  . 


Me  illustrate  the  use  of  this  sufficient  (but  not  a  necessary) 
condition  when  (y)  has  the  Morgenstern  POP, 

C(x,y>  *  1  ♦  a  (1  -  2x) ( 1  -  2y) 

Clearly,  as  |o|  <  1.  |x|  <  1,  |y|  <  1.  3  a  positive  constant  k 
such  that 

C(x,y)  <  k.  V  <x,y)  e[0,l]2 
Note  that 

1-t 

g,  (t)  =  J  z(t+y,y)dy,  V  t  >  0 
L  y=o 

By  the  symnetry  of  C(x,y)  in  and,  it  can  be  shown  that 
gL<-t>  -  8L<t)-  *  t  >  0. 

Now,  using  the  bound  k  for  C(x,y),  and  the  fact  that  [-1,1]  is 
the  support  of  L,  we  get 

1-t 

g.  (t)  <  k  J  dy  <  2k  <  ® 

^  0 


Hence,  it  follows  that  the  PDF  of  L  is  (almost  everywhere)  bounded. 
We  now  show  that  *L(0)  is  real  and  nonnegative  V  a  >  0 

*L<0)  =  E(e1(T_U)0)  =  Ix  ♦  al ^ 


.  T  )  }  i(x-y)6  .  . 

where,  I.  =  |  J  e  dxdy 

0  0 


I 
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1 


with 


„  , 1  ixe . 

z.  =  J  e  dx 

0 

l7  =  II  el(x_y)e  ( l-2x) ( l-2y)dxdy 
0  0 


with  Z  =  1  elxe  (l-2x)dx 
0 

Hence.  *L{0>  =  12^6)  |2  +  a|Z2<6)|2  >  0  If  a  >  0. 


Invoking  Bochner’s  sufficient  condition,  we  get  J  |*  (e) |d6  <  «  , 

L 

-CO 

if  a  >  0.  However,  for  all  a, 


J  |*L(e) |de  =  /  |z1(0)| 


d6  ♦  a 


j  iz2co) r 


(2.9.90) 

so  that  the  two  integrals  on  the  right  hand  side  must  be  finite  when 

00 

a  >  0.  It  follows  that,  even  when  a  <  0,  J  |*. (6)|d0  <  *».  We  con- 

Li 

—00 

elude  that  (2.9.5a)  is  valid  for  any  member  of  the  Morgenstern  family 
of  densities.  It  may  be  remarked,  in  passing,  that,  in  view  of  the 
generality  of  the  conditions  of  Bhattacharya  and  Ranga  Rao  (1976)  and 
Bochner  and  Chandrasekar  (1949).  (2.9.54)  holds  for  many  distribu¬ 

tions  of  (J) . 


m 

I 

I 

8 

i 
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Lastly,  we  discuss  the  validity  of  (2. 9. 55).  To  be  specific, 
when  d»l,  one  can  get  the  bound 

lf?  ^  1  -  P0(1-P0>  ♦  8ln2(fl/2)  V  B  <  6  <  *,  w  -  (J> 

where  pQ  *  po(w)  *  1  -  x  -  y  ♦  2C{x,y) 

Therefore, 

I J  <n,B)  |  <  I  J  n  „-<n-msinJB[P0U-P0n  ^ 

0  0 

Thus.  ■*  0  as  n  -*  •  if  we  show  that  nHp  (n  )  +  o  as 

o'  '  o* 

n  +  where  H  (s)  is  the  Laplace  transform  of  n.  A  sufficient 
n 

condition  for  this  to  happen  is 

p(Po(1“Po)  £  t)  •  °Ct).  »«  t  •*  0  (2.9.91) 

Let  &(t)  and  1-4 ( t )  be  the  roots  of  the  equation 

P  (1-P  )  *  t 
o  o 

It  suffices  to  show,  as  t  -»  0, 

p(pQ  <  *(t) )  =  0(t)  and  (2.9.92) 

P(PQ  >  l  -  *( t)  )  =  0(t)  (2.9.93) 

If  (jj)  is  independent,  then  the  PDF  of  Pq  can  be  shown  to  be 

g  { x )  =  -ln( 1 1 -2x  j )I(x) 
o  [0.1] 


•  -  -  v.  •,  -  •,  v Wvfw*;  ,'v’\ 'v'*: ■:*;*:*;*:*; 

8<®*l8,iB:R!M5»,S«aas8S3e»H5S&K«afife«SB^^ 


So  that  (2.9.92)  and  (2.9.93)  are  valid  when  C(x,y>  =  CQ  where 
C  (x,y)  =  xy.  Also,  If  C(x,y)  >  xy,  then  P  (C)  >  P  (C  )  so  that 

Q  O  □  O 

P(Po(C)  <  A(t) )  <  P(Po(Co)  <  4(t))  (2.9.94) 

Thus,  using  the  exact  calculations  based  on  the  Independence  case. 

It  follows  that 

V  C  >  xy,  P(PQ(C)  <  Mt>)  =  0(t) 

T 

At  this  time,  we  are  optimistically  speculating  that,  when  (y)  are 
dependent,  (2.9.93)  is  also  true.  We  are  yet  to  demonstrate  that 
the  assumption  (2.9.55)  is  not  vacuous  for  any  d  >  1. 

After  we  derived  the  proof  of  Theorem  2.9.2,  we  discussed  the 
Poisson  convergence  problem  with  Professor  Persi  Diaconis,  who 
communicated  the  problem  to  Professor  Charles  Stein.  In  his  Neyman 
lecture  at  the  IMS  Annual  (1984)  meeting.  Professor  Stein  outlined 
an  alternative  proof  of  the  Poisson  convergence  using  his  well-known 
theorem  concerning  the  approximation  of  probabilities.  However,  we 
have  not  seen  any  rigorous  version  of  the  proof  yet. 
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3.  MERGING  FILES  Of  DATA  ON  SIMILAR  INDIVIDUALS 

Problems  of  statistical  matching  were  discussed  in  Chapter  2, 
where  we  assumed  that  the  two  micro-data  files  being  matched  consis¬ 
ted  of  the  same  individuals.  Moreover,  the  files  did  not  have  any 
common  matching  variables.  In  Chapter  1,  practical  and  legal  reasons 
were  cited  for  these  assumptions  not  to  hold  in  certain  situations. 
Suppose,  then,  we  have  two  files  of  data  that  pertain  to  similar 
individuals.  Allowing  for  some  matching  variables  to  be  observed 
for  each  unit  in  tne  two  files,  we  seek  to  nerge  the  files  so  that 
Inference  problems  relating  to  the  variables  not  present  in  the  same 
file  can  be  addressed.  This  scenario  was  labeled  Case  III  in 
Section  l.  in  this  chapter, we  shall  first  review  the  existing 
literature  on  Case  III,  and  then  briefly  discuss  some  alternatives 
to  matching  in  certain  models  in  which  the  non-matching  variables 
are  conditionally  Independent  given  the  values  of  the  matching 
variables,  finally,  we  will  present  the  results  of  a  Monte-Carlo 
study  carried  out  to  evaluate  certain  matching  procedures  relevant 
to  Case  III. 
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3.1  Kadane’s  Matching  Strategies  for 
Multivariate  Normal  Models 

Distance-based  matching  strategies  were  introduced  in  Section 
1.6.  The  choice  of  distance  measures  in  the  matching  methodology  can 
be  motivated  using  a  model  where  the  unobserved  triplet  U  =  (X,Y,Z) 
has  a  multivariate  normal  distribution.  The  set-up  of  the  two  files 
to  be  merged  is  as  follows: 

File  1  comprises  a  random  sample  of  size  nx  on  (X,Z),  while  File 
2  consists  of  a  random  sample  of  size  n2  on  (Y,Z).  Furthermore,  we 
expect  very  few  or  no  records  in  the  two  files  to  correspond  to  the 
same  individuals.  Statistically,  this  means  that,  for  all  practical 
purposes,  the  two  random  samples  are  themselves  Independent .  For 
this  reason,  we  shall  denote  the  sample  data  as  follows. 


(Base)  Pile  1: 


( ,  Z^ ) ,  i  —  1,2,  < • • ,  n^ 


(3.1.1) 


(Supplementary)  File  2:  (Y^.Z^),  j  =  n^l,  ....  nx+n2 


Once  finished,  the  matching  process  leads  to  more  comprehensive 
synthetic  files,  namely 


Synthetic  File  1:  (X^Y-.Z^,  i  =  1,  2 . nx 

Synthetic  File  2:  (XJ.T  ,Z  ),  J  =  n^l,  _  n1+n2 


(3.1.2) 


where,  Y“  is  an  imputed  value  of  Y  that  comes  from  the  original  File 
2  and  Xj  is  an  Imputed  value  of  X  that  Is  taken  From  the  original 
File  1  by  means  of  some  matching  strategy.  We  shall  now  review 


lor 


Kadane  (1978 }'s  development  of  the  matching  methodology  for  a  multi¬ 
variate  normal  model. 

Suppose  that  V  =  (X.Y.Z)  has  a  multivariate  normal  distribution 
with  mean  vector  < y )  and  variance-covariance  matrix 


r  i 


XX 


“yx 


£ 

1—  ^zx 


“xy 


yy 


zy 


*xz 


'yz 


*zzJ 


(3.1.3) 


The  parameters  £  . can  all  be  estimated  consis- 

xx  xz  yy  yz  zz 

tently  using  the  marginal  information  on  (X,Z)  and  (Y,Z)  respectively 

in  the  two  files.  However,  £  is  an  unidentified  parameter,  because 

zy 

the  Joint  likelihood  of  the  data  on  (X,Z)  and  (Y,Z)  is  free  of  the 

matrix  £  .  In  fact,  in  the  domain  in  which  £  is  such  that  the 

xy  xy 


matrix 


l  l 

^xx  ‘xy 

£  l 

y*  yy 


is  positive  semideflnite,  nothing  is  learned 


from  the  data  about  £xy,  except  in  a  Bayesian  framework,  where  £xy, 

^xz’^yz  are*  a  Prlor*»  dependent.  Even  in  this  situation,  the 

posterior  dlstribuion  of  £  „  is  updated  only  through  £  and  £  . 

xy  xz  yz 

Kadane' s  approach  to  merging  File  1  and  File  2  consists  of  the 


following  steps: 

(i)  Start  with  an  imputed  value  of  £  via  some  a  priori  distribu- 

xy 

tion  on  the  covariance  matrix  £,  (ii)  Complete  Files  1  and  2  by 
predicting  the  missing  data,  X  or  Y,  using  the  marginal  information 
in  the  files,  (ill)  Match  these  "completed"  files  based  on  a 
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distance  measure  between  records  of  the  two  flies,  (lv)  Estimate 
parameters  such  as 


Y  =  l  g(w)  dF(Vj  ,  (3.1.4) 

using  the  synthetic  file  resulting  from  Step  (111)  and  repeating  the 

steps  (11)  through  (lv)  many  times  to  find  the  sensitivity  of  the 

estimates  to  the  Imputed  value  of  £  and  finally  weight  the  results 

xy 

using  the  a  priori  distribution  on  £. 

Some  further  details  of  the  steps  outlined  above  are  as  follows: 

Suppose  that  a  an  imputed  value  of  £  is  available.  Then  we  can 

xy 

assume  that  £  is  known  and  complete  the  two  files  by  means  of  condi- 
xy 

tlonal  expectations.  Let  £  ,  for  any  letters  a,  b  and  c,  be  given 

SD  .  C 

by 

y  =  y  -  y  y"x  y 

^ab.c  ^ab  Aac  Acc  ^cb 

Then  the  predicted  value  Y,  say,  of  a  missing  Y  In  File  1  Is  given  by 


Y  =  E(Y |X,Z) 


*  V*  'S-V  *  *,Z.X 


(3.1.5) 


Similarly,  the  predicted  value,  X,  say,  of  a  missing  X  in  File  2  is 
given  by 

X  =  E(X|Y,Z) 

=  u  +  £  V1  (Y-u  )  +  l  l~l  (Z-u  )  (3.1. 

Kx  ‘•xy.z  yy.z  -  cy  xy.y  ^zz.y  ~  Kz 

Using  (3.1.3),  (3.1.5)  and  (3.1.6),  it  is  now  easy  to  show  that 
(X^.Y^.Z^)  is  multivariate  normal  with  mean  vector  (  ,  jj^)  and 


variance-covariance  matrix 


I 

I 

I 

I 

I 

I 

a 


n. 


where 


and 


^xx 

Ai 

^xz 

A1 

A3 

A2 

*zx 

A2 

^zz  - 

*  X 

Lyx, 

.z£ 

.z  ^xx  + 

i  rl  x 

‘■yz.x  ‘zz-x  ‘zx 

*  l 
L2X 

^xx.z 

X  + 

‘■xy  .z 

X  X"1  X 
‘•zz  ‘■zz.x  fczy.x 

-  X 

X-1 

X 

X"1 

x 

‘‘yx.z 

‘■XX.Z 

‘‘XX 

‘‘XX.Z 

‘‘xy  .z 

♦  X 

X-1 

x 

x~l 

x 

‘■yz.x 

‘‘ZZ.X 

fczz 

‘‘ZZ.X 

‘zy  -x 

I 


I 


+  2^yx.z  ^xx.z  ^xz  ^zz.x  ^zy.x 


Also,  the  vectors  <X-j  t  .Zj ) .  J  -  n^l,  ....  n^r^,  have  ( 
multivariate  normal  distribution  with  mean  vector  ( mx • My •  1 
variance-covariance  matrix 


r  a, 


LA6 


Ai 


‘■yy 


'zy 


a:  ~i 


'yz 


Jzz 


A4  =  Sxy.; 


x  rl  x 

‘•yy.z  ‘yy  yy.z  yx.z 


*  ^xz.y  ^zz.y  ^zz  ^zz.y  ^zx.y 


*  2^xy.z  ^"yy.z  ^yz  ^zz.y  ^zx.y 
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(3.1.7) 


common 
:z>  and 


(3.1.8) 


where 


110 


and 


A  ■  V  y  ~ 1  p  ^  y  y-1  y 

n5  *  ‘■yy  ‘■yy.z  ^yx.z  +  ^yz  ^zz.y  izx.y 


A6  *  ^zy  ^yy.z  ^yx.z  +  ^Z2  ^zz.y  ^zx.y 


Mote  that  the  distributions  given  by  (3.1.7)  and  (3.1.8)  are  singular 

because  the  predicted  values  Y.  and  X.  are  linear  functions  of  the 

~i  ~3+n1 

other  components  of  the  random  vectors  T  =  (X^.Y^,;^)  and 

0,  =  (X.  ,  Y .  ,Z  )  respectively,  where  1  *  1,2 . n,  and 

~J  -J+nj  ~>nx  1 

J  =  1,2,  ....  n^ .  In  order  to  describe  Kadane's  procedures  to  match 

the  completed  File  1,  namely,  T  ,  ....  T  with  the  completed  File  2, 

~nl 

namely,  U  ,  ....  U  ,  let  us  first  assume,  for  simplicity,  that 

1 

n1>n2=n.  Starting  with  n  records  in  each  file,  we  will  compute  the 
differences 


r~  X. 


Ii  -  Hj 


X  ,  — i 

~j+n 


Y,  ~ 

~1  ~j+n 

LZ,  -  Z.  J 

~i  ~j+n 


,  1  <  i.  J  <  n 


(3.1.9) 


in  order  to  define  a  measure  of  dissimilarity  between  any  pair  of 
records,  one  each  from  the  two  completed  flies.  Suppose  first  that, 
there  exists  a  vector  of  constants  8.  =  (l^,  ...,  i^)',  s®y,  and  1  and 
J  such  that 

~  Uj)  =  0)  =  1.  (3.1.10) 


In  view  of  the  Independence  of  the  random  vectors 


and  U 


v 


It  is  clear 


Ill 


that  (3.1.10)  cannot  hold.  Consequently,  any  of  the  vectors 
la  free  of  any  linear  relationship  among  its  components.  It  follows 
from  this  fact  and  (3.1.7)  to  (3.1.9)  that  the  differences  -  0  , 

1  <i,  )sn  are  identically  distributed,  each  with  a  nonstngluar 
multivariate  normal  distribution  with  mean  0  and  variance- covariance 
matrix  +  «2.  For  any  positive  definite  matrix  A,  a  dissimi¬ 
larity  measure  between  and  can  be  defined  by  the  quadratic 
form 

d^j(A)  -  (Tt  -  0  >  * A(Tt  -  0^).  (3.1.11) 

Also,  d^j(A)  will  be  referred  to  as  the  distance  between  the  1th  record 
of  File  1  and  the  Jth  record  of  File  2.  Various  choices  of  A  in 
(3.1.11)  provide  different  distance  measures. 

It  may  be  recalled  from  Section  1.3  that  a  constrained  matching 
of  the  two  files  is  obtained  by  minimizing 

n  n 

C  -  l  l  da  (3.1.12) 

1=»1  J  =  i  J  J 

subject  to  the  conditions 

n 

l  a  =  1,  V  i  =  1,2 . n  (3.1.13) 

J  =  1  J 


n 

l  a  =  1.  V  J  =  1,2, 
1  =  1  J 

and 


n 


(3.1.14) 


=  0  or  1 ,  V  1  and  J 


(3.1.13) 
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IF  the  's  in  (3.1.12)  are  given  by  d^ATs  in  (3.1.11)  for  some 
choice  of  A,  then  we  obtain  an  optimal  distance- based  constrained 
match.  Note  that  this  type  of  matching  of  the  two  files  amounts  to 
solving  a  linear  assignment  problem.  Sometimes,  an  optimal  matching 
may  be  obtained  by  minimizing  (3.1.12)  without  requiring  that  the 
conditions  (3.1.13)  and  (3.1.14)  hold.  However,  as  reported  in 
Rodgers  (1984),  unconstrained  optimal  matches  do  not  provide  good 
estimates  of  the  distribution  W  a  (X,Y,Z).  We  shall  not  discuss 
such  "unconstrained  matchings." 

It  Is  important  to  note  that  the  aforementioned  optimization 

problem  needs  to  be  solved  for  each  realization  of  the  random 

variables  involved.  Suppose  then  that  T,  and  have  been  matched 

~i  ~j 

in  a  given  problem.  Then  it  might  be  natural  to  talce  (X. .Y^.Z.)  and 

-1  ~J  ~i 

83  simulations  of  the  underlying  distribution.  Now,  the 
parameter  y  in  (3.1.4)  can  be  estimated  using  one  of  the  following 
synthetic  samples: 

Synthetic  File  1:  (X  ,YJ,Z  ),  i  =  1,2 . n.  (3.1.16) 

Synthetic  File  2:  <X*j,YJ,ZJ),  j  =  n+1,  ...,  2n .  (3.1.17) 


where  YJ  and  XJ  are  values  given  by  the  matching  procedure. 

Kadane  has  suggested  that  matchings  based  on  a  fixed  A  in 
(3.1.11)  and  the  consequent  Inferences  based  on  synthetic  files  such 
as  (3.1.16)  or  (3.1.17)  must  be  repeated  many  times  and  the  results 
must  be  averaged  In  some  sensible  way  in  order  to  explore  the  sensi¬ 
tivity  of  our  findings  to  the  value  of  J  we  started  with.  We  shall 

xy 


113 


not  pursue  such  Issues  as  the  actual  choice  of  a  prior  on  £  and  the 
aforementioned  sensitivity  studies  of  inferences  based  on  synthetic 
data.  However,  we  shall  now  discuss  Kadane's  choices  of  the  matrix 
A.  which  will  be  used  in  our  Monte-Carlo  Study  of  Section  3.3. 

Kadane  has  advocated  two  choices  for  the  matrix  A  in  the  defini¬ 
tion  of  distance  measure  d^,  which  is  given  by  (3.1,11): 

(i)  A  -  +  a2>-1,  (3.1.18) 


where  Sli  and  ft?  are  the  matrices  in  (3.1,7)  and  (3.1.8):  this  A  leads 
to  the  so-called  Mahalanobls  distance  between  the  records  of  the  two 
files,  and 


r  o 


(ii)  A 


L  0 


*zz 


(3.1.19) 


In  general,  the  relative  benefits  of  these  two  distance  measures 
is  an  open  question,  although  the  empirical  studies  of  Barr  et  al. 
(1982)  and  other  investigators  reported  in  Rodgers  (1984)  indicate 
that  the  Mahalanobls  distance  is  worse  than  the  distance  provided  by 
(3.1.19)  In  the  sense  of  distorting  the  bivariate  and  multivariate 
relationships  among  the  variables  X,  Y  and  Z.  In  view  of  this,  we 
shall  follow  Kadane  (1978)  in  calling  the  measure  Induced  by  (3.1.19) 
the  ”blas-advolding  distance  function.”  The  special  case  of  (3.1.19) 
when  Z  has  only  one  component  will  be  discussed  In  the  next 
subsection. 


3.1.1  Isotonic  Hatching  Strategy 


We  shall  evaluate.  In  Section  3.3,  Kadane's  matching  strategies 
In  the  simple  case  when  the  triple  W  -  (X,Y,Z)  has  a  trlvarlate 
normal  distribution.  In  order  to  facilitate  such  evaluations,  we 
now  show  that,  in  the  special  case  of  a  scalar  Z,  the  matching 
strategy  based  on  (3.1.19)  can  be  implemented  without  using  any 
algorithm  to  minimize  distances. 

Assuming  that  Z  Is  scalar  and  using  (3.1.19)  in  the  objective 
function  given  by  (3.1.12),  C  is  equivalent  to 


C  =  l  1  <Z  -  Z..)2  a.  (3.1.20) 

1=1  J=1  3  3 

In  a  constrained  match,  a^j's  are  subject  to  the  conditions  (3.1.13) 
to  (3.1.15).  Thus,  (3.1.20)  further  simplifies  to 


n 

l 

i  =  L 


'11 


n 

l 

J-l 


‘2J 


-  2 


n 

l 


n 

l 


i=l  J  =  1 


ZllZ2jaij 


Hence,  the  minimization  of  distances  reduces  to  maximizing 


C* 


n 


l 

1  =  1 


l 

J=1 


a 


ijZllZ2j 


(3.1.21) 


subject  to  the  conditions  (3.1.13)  to  (3.1.15)  on  the  a^j’s. 

DeGroot  and  Goel  (19T6)  show  that,  given  the  numbers  z^’s  and 

«  •«.  the  constrained  maximization  of  C*  is  equivalent  to  maximizing 

n 

I  over  all  permutations  tp  of  the  Integers 

1‘1  11  2<p(i) 

1,2,  ....  n.  However,  this  latter  extremal  problem  was  encountered 


In  Section  2.4  when  we  derived  the  M.L.P  e"  for  certain  bivariate 
matching  problems.  It  follows  that,  with  regard  to  Kadane's  distance 
measure  given  by  (3.1.19),  where  Z  is  scalar,  the  optimal  matching 
strategy  is  to  order  the  Z-values  in  the  two  files  separately  and 
then  match  the  1th  largest  Z  Vn  file  1  with  the  1th  largest  Z  in 
File  2.  This  explicit  solution  means  that,  if  Kadane's  matrix  in 
equation  (3.1.19)  is  used  to  minimize  distances  between  records  of 
the  two  files,  then  the  synthetic  File  1  is  obtained  by  matching  the 
the  X-concomitant  of  the  1th  order-statistic  among  Z*s  in  File  l  with 
the  Y-concomitant  of  the  1th  order  statistic  amont  Z‘s  in  File  2. 

We  shall  refer  to  this  strategy  as  isotonic  matching  of  the  two  files 
because  the  matching  procedure  is  determined  by  the  order-statistics 
of  the  Z ' 8  in  File  1  and  the  order-statistics  of  the  Z's  in  File  2. 

3.1.2  Sims'  Hatching  Strategy 

In  the  preceding  subsection.  It  was  shown  that  one  of  Kadane's 
matching  strategies  can  be  simplified  to  the  point  of  not  using  any 
optimization  algorithm  In  the  matching  procedure.  Such  simplifica¬ 
tion  is  clearly  not  possible  when  the  triple  (X,Y,Z)  has  a  multi¬ 
dimensional  Z  .  The  whole  idea  of  generating  very  large  synthetic 
data  sets  by  actually  minimizing  a  sum  of  distances  over  all 
potential  matches  seems  computationally  profligate.  One  possible 
alternative  to  distance-based  strategies,  which  was  suggested  by 
Sims  (1978),  will  now  be  outlined. 


Sims  has  stressed  the  Importance  of  exploiting  the  local  sparse¬ 
ness  or  denseness  of  the  sample  data  on  the  matching  variables  Z.  k 
dense  region  of  the  Z-space  is  one  within  which  we  expect  that  the 
distributions  of  X  and  T  given  Z  change  little.  It  Is,  at  the  same 
time,  a  region  within  which  we  have  many  observations.  Sims  has  sug¬ 
gested  that,  within  a  dense  region,  any  arbitrary  matching  procedure 
will  produce  results  that  do  not  distort  the  joint  distribution  of 
X,  7  and  Z.  Regions  which  are  not  dense  have  few  observations  and, 
within  them,  statistical  matching  becomes  difficult.  Sims  felt  that 
in  a  sparse  region,  statistical  matchings  will  almost  certainly 
distort  the  Joint  distribution  of  X,  Y  and  Z.  He  suggested  that,  in 
such  a  region,  we  should  either  not  match  at  all  or  go  beyond 
matching  to  more  elaborate  methods  of  generating  synthetic  data. 
However,  Sims  did  not  spell  out  any  specific  alternative  to  matching 
within  sparse  Z- regions. 

In  our  Monte-Carlo  Study  for  comparing  Kadane’s  strategies  with 
Sim's,  which  will  be  presented  In  Section  3.3,  we  created  ten  bins 
In  the  Z-space,  namely  ( -», -1 .00] ,  (-1.00,-0.75],  (-0.75,-0.50], 
(-0.50,-0.25],  (-0.25,0.00],  (0.00,0.25],  (0.25,0.50],  (0.50,0.75], 
(0.75,1.00],  (1.00,+®).  The  conditional  mean  of  X  or  7,  given  Z  did 
not  change  much  inside  the  eight  bins  which  were  between  1.00 


and  1.00.  Hence,  these  latter  bins  were  considered  dense  bins  and 
the  two  bins  in  the  left  and  right  tail  of  the  distribution  of  Z  were 
considered  sparse  bins.  Within  each  dense  bin,  we  randomly  matched 
records  of  the  two  files,  whereas  the  isotonic  matching  strategy  of 
Subsection  3.1.1  was  used  in  the  sparse  bins. 

3.2  Alternatives  to  Statistical  Matching 
Under  Conditional  Independence 

Several  criticisms  of  the  matching  methodology  were  mentioned  in 
Section  1.6.  it  was  observed  that  the  formation  of  packets  on  the 
basis  of  matching  variables  Z  and  the  merging  of  records  within  each 
packet  imply  that  the  non-matching  variables  X  and  Y  are  condition¬ 
ally  Independent  given  the  values  of  Z.  Following  A.  P.  Dawid  (1979) 
we  shall  use  the  notation  X  U  Y  |  Z  to  denote  the  conditional  indepen¬ 
dence  among  the  variables  X,  Y  and  Z. 

Consider  the  situation  in  which  we  match  the  fragmentary  data 
provided  by  the  files  in  (3.1.1).  It  may  be  recalled  from  Section 
1.2  that  any  statistical  model  for  this  type  of  matching  should  Imply 
that  the  data  In  File  1  is  stochastically  independent  of  the  data  in 
File  2.  Clearly,  such  flies  of  data  cannot  be  used  to  statistically 
test  the  validity  of  the  Implicit  assumption  that  X  f  Y  |  Z.  Further¬ 
more,  Sims  (1978)  has  observed  that  matching  itself  for  the  purpose 
of,  among  others,  estimating  y  In  (3. 1.4)  is  unnecessary.  He  pointed 
out  that,  when  X  |  Y  |  Z  holds,  one  can  write 
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XZ  YZ  Z 

dF(w)  =  dF  (w)  dF  (w)/dF~(w) , 


(3.2.1) 


XZ 

where  F  ( . )  ts  the  marginal  (with  regard  to  U)  CDF  of  X  and  Z  and 
the  other  terms  on  the  right-hand  side  of  (3.2.1)  are  analogously 
defined  marginal  distribution  functions.  The  two  separate  samples  in 
(3.1.1)  are  adequate  to  estimate  all  the  terms  on  the  right-hand  side 
of  (3.2.1)  by  any  of  a  number  of  statistical  methods.  In  this  sec¬ 
tion,  we  will  discuss  seme  alternatives  to  matching.  With  emphasis 
on  estimating  the  covariances  or  correlations  between  X  and  Y,  we 
shall  first  review  a  histogram- type  alternative  which  was  suggested 
by  Sims  (1978) . 


Suppose  that  we  form  a  grid  in  the  W  space  and  estimate  the 
Joint  density  of  W  by  first  counting  the  number  of  sample  points  in 
each  cell  of  the  z  grid.  Let  i  index  X-categories ,  J  index 
Y-categorles  and  k  index  Z-categorie3 .  Let  the  number  of 

sample  points  in  the  (i,J,k)th  cell  and  use  the  dot  notation  to 
define  counts  of  sample  points  with  regard  to  marginal  distributions. 
Thus ,  we  have 


i  .k 


•  Jk 


-  the  number  of  sample  points  with  X  in  the  ith  category 
and  Z  in  the  ktfl  category, 

=  the  number  of  sample  points  with  Y  in  the  category 
and  Z  in  the  kth  category, 


n  ^  -  the  number  of  sample  points  with  Z  in  the  kth  category. 


and 
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Clearly, 
n 


.  .k 


S  ni.k  =  jj  n.Jk 


and  the  data  in  the  tv«o  files  given  by  (3.1.1)  can  be  used  to  compute 
n  ,  n  and  n  for  all  possible  values  of  i,  J  and  k.  Thus, 

-  •  K  *jiC  »  •« 

k  is  obtained  from  File  1,  n  ^  from  File  ?  and  n  from  the  two 
files  together.  Finally,  for  a  known  function,  g( . ) ,  say,  let  g(w^jk) 
denote  the  value  of  g  computed  at  the  center,  w^^  of  the  (i.j.kf*1 
cell  of  the  grid  that  we  started  with.  Sims  has  suggestad  that  we 
could  estimate  y  In  (3.1.4)  by  the  statistic 


;  -  l  8<«l1k>  nink  l,'3k  (3.2-3) 

l.J.k  J  ..k 

With  regard  to  y  in  (3.2.2),  theoretical  properties  such  as  the 
asymptotic  distribution  of  y  (as  the  sample  3lze  tends  to  ®)  are 
unknown  at  the  present  time.  Also,  practical  problems  such  as  the 
choice  of  W-grld  and  the  cells  thereof,  which  would  keep  the  number 
of  terms  in  the  sum  (3.2.2)  computationally  reasonable,  have  not  been 
studied  yet. 

Sims  (1978)  stated  that  a  procedure  like  the  one  leading  to  y 
in  (3.2.2),  which  takes  Into  account  the  implicit  assumption  of  con¬ 
ditional  independence  of  the  matching  methodology,  had  the  following 
advantages  over  matching  to  create  a  synthetic  file  such  as  (3.1.16): 


(a)  the  procedure  lends  itself  to  computation  of  standard  errors 
indicating  the  reliability  of  computations  based  on  It 
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(b)  the  procedure  can  be  connected  to  the  large  statistical  litera¬ 
ture  on  estimating  density  functions  and  multi-dimensional 
contingency  tables,  and 

(c)  it  is  likely  to  provide  more  accurate  results  than  matching. 

Given  the  lack  of  work  on  the  statistical  properties  of  the  aiterna 
tlves  to  matching,  we  can  agree  with  the  advantages  (a)  and  (b),  but 
regard  (c)  as  an  undemonstrated  speculation.  Ve  shall  not  discuss 
Y  in  (3.2.2)  any  further.  Nor  shall  we  elaborate  the  merits  and 
demerits  of  alternatives  to  matching  and  synthetlc-data-based  pro 
cedures.  Nevertheless,  in  the  next  subsection,  we  shall  derive  the 
estimators  of  parameters  for  conditionally  independent  normal  models 
without  matching  the  files  in  (3.1.1). 

3.2.1  Maximum  Likelihood  Estimation  in  Multivariate  Normal  Models 
Using  Two  Files  of  Data 

Consider  the  random  vectors  X,  Y  and  Z,  with  respective  dimen 
sions  p^,  p2  and  p3 .  Suppose  that  V  =  (X,Y,Z)  has  a  nonsingular 
multivariate  normal  distribution  with  unknown  mean  vector 

a°d  unknown  variance  covariance  matrix  £,  which  is 
partitioned  as  in  (3.1.3).  Suppose  that  the  sample  data  in  (3.1.1) 
is  available  and  that  n2^2+p3'  Note  that>  in  view  of  the 

nonsingularity  of  distribution  of  W  and  the  fact  that 
Z^,  . . . ,  Z^  are  stochastically  Independent,  the  ranks  of  the 
matrices  <  Z  ^ ,  ....  ?n  >  and  <5n  +i>  ■•••  Zn  +n  *  are  scjiial  to  P3  for 
almost  every  realization  of  the  Z's. 


i 


a 

a 

i 

i 


a 


£ 


i  a  gaga  a  aaa 


„  M  w  fe.  Ifc  > 


.  »  i  *  W* 


-i w  to- 


In  this  section,  we  shall  find  the  maximum  likelihood  estimator 


of,  among  others,  the  covariances  among  the  variables  in  the  vectors 
X  and  Y,  without  matching  the  files  (3.1.1)  but  assuming  that 
XllY|Z.  The  maximum  likelihood  estimation  of  parameters  in 
multivariate  normal  models  based  on  various  patterns  of  missing  data 
has  been  discussed  In  the  literature.  See,  for  example,  Eaton  and 
Karlya  (1983)  Kariya  et  al.  (1983),  Anderson  (1984)  and  Srivastava 
and  Khatri  (1979).  However,  the  pattern  of  data  given  by  the  set-up 
(3.1.1)  does  not  seem  to  have  been  examined.  Note  first  that,  under 
conditional  Independence,  the  density  of  w  can  be  written  as 

fw(w;e)  =  (3.2.3) 

wh.re  e  .  <VVV*x*.VJwV£m’  (3.2.4) 

and  fy(w)  is  the  joint  density  of  V  given  by 


f«‘s> 


x  etr[-  |  I  l(w  -  jj)(w  £)  '  ]  ,  (3.2.5) 

etr  being  the  exponential  of  the  trace  of  a  matrix.  Also,  f^(.)  is 
the  marginal  density  functon  of  Z,  f?(  .  )  and  f^(.)  are  respectively 
the  conditional  densities  of  X  and  Y,  given  Z  =  z.  It  is  well  known 
(Anderson,  1984,  p.  33  and  37)  that  f  ,  f^  and  f3  also  correspond  to 
certain  multivariate  normal  densities  like  (3.2.5).  Using  the  Joint 
normality  of  X,  Y  and  Z,  it  is  easy  to  verify  that  (3.2.3)  holds  iff 
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(3.2.6) 


It  follows  from  (3.2.3)  that  the  likelihood  of  the  observed 
data  In  the  two  files  given  by  (3.1.1)  Is 


L(6)  = 

L1(0)L2(0)L3(6)  , 

(3.2.7) 

where 

L^g) 

n^+ng 

=  n  rl(.  .•> 

(3.2.8) 

1^(0) 

i=i 

ni 

(3.2.9) 

and 

L3(6) 

1=1 

n^+ng 

*  ",  «y*l|*l.8> 

(3.2.10) 

l=ni4-l 

Taking  natural  logarithms  of  both  sides  cf  the  equation  (3.2.7),  we 
obtain 

3 

1(6)  =  l  1(0)  ,  (3.2.11) 

where  l  (0)  -  log  (L  (0)),  V  a  =  1,2,3 
a  6  a 

Let  z  and  s denote  respectively  the  mean  and  the  matrix  of 

corrected  sums  of  squares  and  products  of  the  data  z^,  ....  z^  ^  . 

1  2 

That  Is, 


z  - 


Vn2 


ni*-n2 

l 

1  =  1 


(3.2.12) 
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ni+nj 

S  *  l  (2  -  Z)(Z  -  Z) * 

Z  1=1  1  ~ 

Similarly,  let  (z2>  and  SjtSg)  be  the  mean  and  the  matrix  of 

corrected  sums  of  squares  and  products  of  the  data  z,  ,  ....  z 

-1  ~ni 

(z  ,,...,  z  ).  Let,  for  any  lower-case  a,  b  and  c,  and  any 
~n,  ♦1  ~n, +n_ 


vector  z. 


Ha.b(2}  =  Ha  ♦  ^ab  Xbb  <2  ~  V 

y  =  y  _  y  y-1  y 

4ab.c  4ab  ^ac  **cc  ^cb 


Then  using  the  notations  in  (3.2.12)  and  (3.2.13),  the  equations 
(3.2.5),  (3.2.7)  to  (3.2.10)  and  Theorem  2.5.1  of  Anderson  (1984) 
(for  the  expressions  defining  f2  and  f3>  we  obtain 


«-!  (0) 


n^+n2 

2 


1o*I2zzI 


tr{ 


1  r1 

2  LZZ 


[sz  +  (n1+n2)(z  -  Hz)(z  - 


(3.2.14) 


vs>  ■  -  r 


l  r-1 


nl 


trl'  i  Sxx.z^  (*i  -  Hx.z(5l))(Xi  -  «x.z(5i,)]} 


(3.2.15) 


l3<®>  = 


^2 

2 


log  IX 


yy  .z 


and 
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nl+n2 

tr!- i  £■.' JL,  ^ 


J=n1-*-l 


(3.2.16) 


Note  that  In  (3.2.14)  to  (3.2.16),  certain  constant  terms  have  been 


omitted. 

It  is  clear  from  (3.2.7)  and  (3.2.11)  that  the  M.L.E  of  6  is 
obtained  by  maximizing  1  (6)  over  6  for  each  a  -  1,2,3  separately. 

Ol  ~  ~ 

Moreover,  this  maximization  is  easier  if  we  reparametrize  the  distri¬ 


bution  of  W  by  means  of 


~  "  ^z'^zz'-xy’-yz'^xx  .z’  ^yy  .z^xy^yz* ' 


(3.2.17) 


where,  apart  from  the  notations  that  we  have  already  Introduced,  we 
have,  for  any  letters  a  and  b 


Y  I_1 

Aab  ^bb 


(3.2.18) 


ab  ^b 


It  can  be  easily  shown  that  there  is  a  one-to-one  correspondence 

between  6  and  n  Consequently,  if  we  rewrite  l  (6)'s  in  terms  of  n, 

~  ~  <1  *'  ~ 

then  maximizing  L(0)  over  Q  is  equivalent  to  maximizing  1  (n)  over  n, 

~  ~  a  ~  ~ 

for  each  a  =  1,2,3.  The  advantage  of  the  transformation  to  the 


n-space  is  that  1  (n)'s  are  functions  of  disjoint  portions  of  n- 

~  a  ~ 

In  fact,  lj(n)  is  the  same  as  1^(0),  whereas  it  follows  from  (3.2.16) 
to  (3.2.)fi  that 


ra 


I 

a 

a 


a 

a 

a 

8 

1 

a 

a 

i 

8 


a 

i 


£ 

8 
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V*1  -  -  ?  1o‘I*xx.z' 


1  -1  **1 

+  tr{-  5  l  [  l  (x.  -  V  -  B  z,)(x,  -  \>  -  B  z,)*]) 

1  2  ‘•XX.Z  .*•,  -1  ~xz  xz  ~1  ~1  ~xz  xz  -i 

1=1 


(3.2.19) 


and 


MS>  = 


n2 


’ 3 "  2  l0g'*yy.z* 

i  _i  nl*n2 

*  tr<"  2  In.,'  J  '*3  -  %z  -  V  V‘*J  -  V  -  B,x  V” 


(3.2.20) 


In  view  of  Theorem  8.2.1  of  Anderson  (1984),  it  can  be  easily 
shown  using  (3.2.14),  (3.2.19)  and  (3.2.20)  that  M.L.B  of  tj  Is 
given  by 


*z 


=  I 


Sz 

Hi  *■1*2 


B 

xy 


"l  -1 
(  I  (^  -  5)(zt  - 


M 

~xy 


Z 


I 


^l+n2 

[  z  (Y  -  Y)(Z  -  Z-riS"1 


(3.2.21) 


1 


7  -  B  I 

~  Dyz  ~2 


1  ni 

r  1  <Si 

1  1=1 


u  -  B  Z4 ) (X.  -  v  ~  B  Z4 ) 
~XZ  XZ  ~1  ~1  ~XZ  xz  ~1 


nl+n2 

i  =  —  l  (Y.  -  «  -  B  Z.KY.  -  V  -  B  Z4)' 

^yy.z  n2  ^  *  +l  ~J  ~yz  yz  ~j  ~j  ~yz  yz  ~j 

Using  these  estimators  and  the  relationships  between  0  and  n  we 
obtain  the  M.L.E  of  0  by  means  of  the  following  equations. 


u  =  w  -*■  B  u 
Kx  ~xz  xz  rz 


u  =  v  +  B  u 
ry  ~yz  yz  cz 


B  I  B'  +  t 
xz  zz  xz  xx.z 


(3.2.22) 


I  =  B  l 
'•xz  XZ  ‘•ZZ 


l  =  B  T  B  *■  i 
yy  yz  ‘‘ZZ  yz  ‘•yy.z 


X  =  B  I 
yz  yz  zz 


and  *xy  =  *xz  £  *zy 


ii-)hii  iTi7i.Timf>-  wiii  iVjt-  itiVinwTrfr*1rrinTTrFr^V<n~'ffr--i7T~iirrg'~ri^iTT?-ri^Tffffrrf^W^^^r-f“Fr-rtyfc:ac3aM>~a:,¥**ae 


* 

I: 


1 

I 
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It  follows  from  the  above  discussion  that  if  we  can  justify  the 
assumption  that  X  U  Y  |  Z,  then  we  can  avoid  matching  the  files  In 

0.1.1)  and  estimate,  among  other  parameters,  Ixy ,  by  means  of  the 
equations  in  (3.2.22).  Unfortunately,  the  two  data  files  contain  no 
Information  regarding  the  appropriateness  of  this  assumption,  and 
prior  information  from  other  sources  must  be  considered.  The  point 
here  is  that,  if  the  matching  methodology  is  based  on  assumptions 
like  X  H  Y  |  Z,  then  we  must  look  for  alternatives  to  matching  whose 
statistical  properties  are  known.  Such  alternatives  are  useful 
especially  because  very  little  is  known  about  the  reliatllity  of 
synthetic  data  files. 

It  is  Important  to  note  that  (3.2.6)  is  a  necessary  condition 
even  if  W  is  not  normal,  provided  only  that  X  JJ,  Y  |  Z  holds  and  that 
the  appropriate  moments  of  the  distribution  of  W  exist.  Hence,  we 
can  use  the  estimator  £  in  (3.2.22)  even  for  non- normal  popula 

A  Jf 

tion3.  We  now  show  that  £  is  consistent  for  £  without  assuming 

xy  xy 

that  W  has  a  multi-variate  normal  distribution . 

Theorem  3.2.1  Suppose  the  joint  distribution  of  W  is  such  that  its 
second  order  moments  exist  and  that  the  dispersion  matrix,  X>  of  W  Vs 

partitioned  as  in  (3.1.3).  If  X  JJ  Y  |  Z  then  £  ,  given  by 

— - 

(3.2.22),  is  strongly  consistent  for  X  • 

xy 

Proof:  We  first  note  that  £  and  X  are  stochastically  Independent 

x  z  z  y 

because  they  are  functions  of  the  Independent  data  In  File  1  and 

File  2  respectively.  However,  £  Involves  Z  's  in  both  files  so 

Z2  1 

that  the  elements  of  the  vector 
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^xz’^zz'^zy* 


(3.2.23) 


are  dependent.  The  almost  sure  convergence  of  the  vector  In  (3.2.23) 


will  follow  from  the  almost  sure  convergence  of  l  ,  X  ,  X 

xz  zz  zy 

Individually  (cf.  Serfllng,  1980,  p.  52).  In  view  of  the  similar¬ 


ities  of  the  proofs  of  the  convergence  of  these  matrices,  we  shall 

only  show  that,  as  n  -*  <*>,  a  =  1,2, 

a 


*  a .  s 

x  -»  i 


(3.2.24) 


We  obtain  from  (3.2.21), 


nl+n2 

'2z  "  n1+n2  l^1  ~1  ~i 


(3.2.25) 


Recalling  our  assumption  that  the  files  In  (3.1.1)  are  Independent 
random  samples  and  that  the  vector  Z  has  a  finite  dispersion  matrix, 
It  readily  follows  that  the  Strong  Law  of  large  numbers  (cf. 
Serfllng,  p.  27)  applies  to  independent  sequences  { }  and  {Z^}. 


Hence,  we  obtain,  as  n  *  ® 

a 


,  nl+n2 

~~ —  t  z,  z:  -»  e ( z  z 1 ) 

n,+n„  .  ,  ~1  ~  ~ 


(3.2.26) 


12  1=1 


a .  8 

7  -*  E(Z ) 


(3.2.27) 


It  follows  from  (3.2.25)  to  (3.2.27)  that 


is  a  continuous  Function  of  the  random  variables  in  the  vector 

(3.2.23).  Hence,  the  strong  consistency  of  £  follows  from 

xy 

(3.2.28).  D 


Several  distance- based  matching  strategies  for  creating 
synthetic  data  have  been  discussed  In  Section  3.1.  Specifically,  two 
strategies  due  to  Kadane  (1978)  and  a  strategy  which  was  proposed  by 
Sims  (1978)  were  mentioned.  In  this  section,  we  shall  evaluate  these 
three  strategies,  individually  as  well  as  In  relative  terms,  In  the 
special  case  where  W  =  (X,Y,Z),  the  unobservable  vector,  has  a  trl- 
varlute  normal  distribution.  Before  we  discuss  the  Monte- Carlo  Study 
of  the  aforementioned  strategies,  we  shall  review  some  of  the  earlier 
simulation  studies  of  statistical  matching  procedures,  which  have 
certain  bearing  on  our  study.  A  more  comprehensive  review  of  evalua 
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tlons  of  statistical  matching  procedures  can  be  found  in  Rodgers  (1984). 

Barr  et  al.  (1982)  used,  among  others,  a  statistical  model  in 
which  a  vector  U  =  (X.Y.Z^.Z^)  had  a  four-dimensional  normal  distri¬ 
bution  with  zero  means,  unit  variances  and  various  levels  of 
covariances  among  the  four  variables.  Altogether,  these  investi¬ 
gators  generated  100  pairs  of  independent  files,  namely  File  1 
comprising  200  observations  on  (X.Z^Z^  and  File  2  consisting  of  200 
observations  on  Y,  Z^  and  Z?,  for  each  of  12  populations,  where  the 
populations  differed  with  respect  to  the  covariances  of  the 
variables.  Then,  for  each  such  pair  of  files,  six  statistical 
matches  were  performed,  namely  three  constrained  matches  and  three 
unconstrained  matches.  In  each  of  these  six  matches,  they  used  three 
distance  functions  for  each  type  of  match.  The  first  was  a  weighted 
sum  of  the  absolute  differences  of  the  two  Z  variables  between 
records  of  the  two  files  and  the  last  two  were  the  Mahalanobis 
distance  and  the  "bias- avoiding"  distance,  which  were  discussed  in 
Section  3.1.  A  summary  of  the  findings  of  Barr  et  al .  is  as  follows. 

All  three  distance  measures  provided  accurate  estimates  of  the 
variance  of  the  Y  variable  when  the  constrained  matching  procedure 
was  used.  They  also  found  that  all  three  unconstrained  matching 
procedures  produced  Y  distributions  that  had  means  which  were 
significantly  different  from  the  corresponding  population  values. 

The  estimated  covariances  of  Y  with  Z^.Z^,  which  were  computed  only 
for  constrained  matches,  tended  to  be  underestimated.  With  respect 
to  the  most  Important  question  in  the  context  of  merging  files. 


namely  the  estimation  of  relationships  between  X  and  Y  variables.  It 
was  reported  that,  if  the  conditional  Independence  assumption  was 
Invalid,  all  statistical  matching  procedures  provided  estimates  of 
the  X-Y  covariance  that  were  extremely  poor.  On  the  other  hand,  for 
the  cases  in  which  the  conditional  Independence  assumption  was  valid, 
all  six  procedures  provided  estimates  of  the  X-Y  covariance  that  were 
generally  quite  accurate.  Their  simulations  also  Indicated  that  the 
Mahalanobls  distance  measure  produced  less  accurate  matching  than 
subjectively  weighted  distance  measures. 

As  we  mentioned  earlier,  our  own  Monte-Carlo  study  was  confined 
to  a  trivarlate  normal  model.  However,  our  findings  were  suffi¬ 
ciently  Interesting  to  Justify  their  Inclusion  In  this  thesis.  In 
fact,  some  new  facts  about  Kadane's  bias-avoiding  matching  strategy 
have  already  been  mentioned  in  Section  3.1.  Suppose,  then,  that 
U  =  (X.Y.Z)  is  trl-variate  normal  with  zero  means  and  variance- 
covariance  matrix 


_  l 


P  P 

xy  xz 


pip 

pxy  yz 


pp  1 

*xz  yz 


(3.3.1) 


Assume  further  that  the  Following  data  is  available  for  the  purpose 
of  estimating  the  three  unknown  correlations  In  (3.3.1): 
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File  1:  (Xi,21),  i  =  1.2 . n 


File  2:  (Y^.Zj) ,  J  =  n*l,  ....  2n 


(3.3.2) 

(3.3.3) 


In  view  of  the  discussions  In  Section  3.2,  if  the  conditional 
Independence  assumption  X  Y  |  L  or,  equivalently. 


P  -  p  p 

xy  xz  yz 


(33.4) 


were  true,  then  we  can  avoid  merging  the  flies  In  (3.3.2)  and  (3.3.3) 

because  File  1  and  File  2  can  be  used  to  get  the  sample  correlations 

p  and  p  ,  which  in  turn  provide  the  maximum  likelihood  estimator 
xz  yz 

of  p  ,  namely 

*  J 


p  ~  p  O 

xy  xz  yz 


(3.3.5) 


We  shall  say  X  and  Y  are  conditionally  dependent,  given  1,  Iff 
(3.3.4)  does  not  hold;  that  Is 

p  *  p  p 

xy  xz  yz 

For  the  sake  of  simplicity,  we  shall  consider  hereinafter  only  the 
conditional  positive  dependence  case  of  the  model  in  (3.3.1),  namely 

p  >  p  p  (3.3.6) 

xy  xz  yz 

The  complementary  case  of  conditional  negative  dependence,  namely 


p  <  p  p 

xy  xz  yz 


can,  however,  be  handled  by  methods  similar  to  ours.  We  shall  also 
include  the  case  when  X  {[  Y  |  Z  holds  mainly  for  comparing  and 
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contrasting  our  results  for  the  positive  dependence  case.  Finally, 

we  shall  evaluate  matching  strategies  only  from  the  point  of  view  of 

estimating  p  ,  the  correlation  between  variables  which  are  not  in 
xy 

the  same  file,  because  File  1  and  File  2  can  respectively  be  used  to 

estimate  the  remaining  parameters  p  and  p  . 

xz  yz 

It  is  clear  that,  if  the  condition  X  |  Y  |  Z  does  not  hold,  then 

we  should  not  estimate  p  by  means  of  (3.3.5).  In  such  a  case, 

xy 

matching  the  files  (3.3.2)  and  (3.3.3)  for  estimation  purposes  is  an 
alternative  that  we  shall  study  in  this  section.  Thus,  if  after 
merging,  File  1  becomes  the  synthetic  File  1  namely 

(X1,Y-,Z1),  1  =  1,2 . n  (3.3.7) 

where  Y*  is  the  value  of  Y  assigned  to  the  1th  record  in  the  process 
of  merging,  then  we  shall  use  the  synthetic  data  (X^.Y*), 

1  =  1,2 . n  to  estimate  p  . 

Ajf 

It  was  mentioned  in  Section  1.7  that  performance  characteris¬ 
tics,  which  can  help  us  assess  the  reliability  of  synthetic  data 
generated  by  Independent  files  in  (3.3.2),  are  not  known.  Given  this 
paucity,  our  program  for  an  empirical  evaluation  of  matching  strate¬ 
gies  is  as  follows 

(i)  Starting  with  a  known  correlation  matrix  given  by  (3.3.1), 
generate  data  from  the  normal  population  of  W  =  (X,Y,Z)  and 
create  independent  files  (3.3.2)  and  (3.3.3).  Note  that  data 
on  (X,Y),  which  is  typically  missing  in  actual  matching 
situations,  is  available  in  simulation  studies. 
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(11)  Using  any  given  matching  strategy,  merge  the  two  files  created 
in  Step  (1)  and  compute  the  "synthetic  correlation",  denoted 
by  ,  which  Is  defined  to  be  the  sample  correlation  coeffi¬ 
cient  based  on  the  <X,Y")  data  given  by  the  synthetic  file 
(3.3.7) 

(ill)  Compare  p&  of  Step  (11)  with  the  following  sample 
correlations : 

(a)  p  the  sample  correlation  coefficient  based  on  the 

mu 

unbroken  data  (X^.Y^),  1  =  1,2 . n  which  was  genera 

ted  In  Step  (1).  Observe  that,  if  there  Is  no  aprlori 

restriction  on  the  model  parameters  In  (3.3.1),  then  p 

is  the  maximum  likelihood  estimator  of  p  . 

xy 

(b)  PiT1^2*  the  estimator  of  PXy  given  by  (3.3.5),  which  is 
also  the  maximum  likelihood  estimator  of  p  when  condi- 
tlonal  Independence  holds. 

Because  P^^  and  Pm^2  are  respectively  based  on  one 
sample  on  (X.Y)  and  two  independent  samples  on  (X,Z)  and 
(Y,Z),  we  shall  also  refer  to  these  as  one- sample  and  two 


sample  estimates  of 


P 


xy' 


Using  the  aforementioned  program,  we  shall  evaluate  Kadane's 
distance-based  matching  strategies  discussed  in  Section  3.1,  namely 
the  isotonic  matching  strategy  and  the  procedure  Induced  by  the 
Mahalanobls  distance,  and  the  method  of  matching  in  bins,  which,  a3 
explained  In  Subsection  3.1.2,  is  an  adaptation  of  a  strategy  due  to 
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a 

8 

1 
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Sims  (1978).  The  synthetic  correlations  resulting  from  the  use  of 
these  three  strategies  will  be  denoted  by  Psl.  Pg2  and  Pg3 
respectively. 

Our  study  has  been  conducted  for  three  values  of  n,  namely  10, 

25  and  50.  The  values  of  the  population  correlation  p  which 

are  used,  among  others,  to  generate  random  deviates  from  the  normal 

population  of  U  s  (X,Y,Z),  were  chosen  from  the  following  categories: 

Low  pXy :  0.00,  0.25 

Medium  p  :  0.50,  0.60,  0.65,  0.70  (3.3.8) 

xy 

High  p  :  0.75  (0.05)  0.95,  0.99 
xy 

Combined  with  low  as  well  as  high  values  of  p  and  p  ,  there  were 

®  xz  ryz’ 

15  choices  of  p  from  (3.3.8)  such  that  the  conditional 

Ajf 

Independence  restriction  (3.3.5)  was  satisfied.  As  remarked  earlier, 

these  correlations  were  chosen  mainly  to  provide  a  basis  such  that 

the  estimates  of  p  resulting  from  the  case  of  conditional 
xy 

positive  dependence  can  be  compared  with  those  resulting  from 

conditional  independence.  The  fifteen  values  of  p  in  the 

xy 

conditional  independence  case  were  increased  in  such  a  way  that  the 
positive  dependence  was  achieved.  Altogether,  nineteen  such  J>*s 
were  selected. 

For  n=10,  U  was  generated  1000  times  by  using  the  IMSL 
subroutines.  The  calculation  of  pgl  was  based  on  sorting  Z's  in 
the  two  files,  as  discussed  in  Section  3.1.1.  Furthermore,  p  0  was 
computed  for  each  realization  by  solving  a  linear  assignment  problem. 
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1 
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The  Ford-Fulkerson  algorithm  {Zlonts,  1974)  was  used  For  this 
purpose.  The  computational  cost  for  solving  assignment  problems  grew 


quite  rapidly  with  n.  Therefore,  only  700  Independent  samples  of 
size  n=25  were  generated.  A  comprehensive  examination  of  the  results 
for  n=10,25,  revealed  p  and  p  ,  the  correlations  corresponding 

SI 

to  Kadane’s  two  distance  measures,  were,  for  all  practical  purposes. 
Identical  (see  Figures  3.1  and  3.2).  In  view  of  this  and  the  high 
computational  costs,  we  compared  only  two  strategies,  the  isotonic 
and  the  method  of  matching  In  bins  for  n=50  (2500  independent 
samples) . 

Four  summary  statistics,  namely  the  mean,  the  standard 
deviation,  the  minimum  and  the  maximum  for  the  simulated  data  on 

pmS.l,pmS.2,psl,ps2,ps3  were  calculated  for  34  £‘s  selected 
for  the  study.  However,  we  provide  these  statistics  only  for  a 
representative  collection  of  15  X's  In  tables  3.1  to  3.7.  For 
each  ^  and  for  any  p,  the  first  entry  in  the  tables  Is  the  mean, 

the  second  entry  (in  parentheses)  Is  the  standard  deviation  and  the 
third  and  the  fourth  entries  are  respectively  the  minimum  and  the 
maximum.  Also,  the  General  Plotting  Package  at  The  Ohio  State 
University  was  used  to  plot  the  following  pairs  of  estimates  of  p 

xy 


(1) 

Psl 

vs . 

°s2 

(11) 

P3l 

vs . 

Ps3 

(ill) 

psl 

vs . 

pmll 
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(lv) 

psl 

vs . 

p»!2 

(V) 

ps2 

vs . 

pmtl 

(Vi) 

ps2 

vs . 

pmt2 

(vli) 

ps3 

vs . 

pmll 

(viii) 

ps3 

vs . 

pml2 

figures  3. 

1  to 

3.20 

3.3.1  Conclusions  of  the  Monte  Carlo  Study 

Tables  3.1  to  3.4  clearly  show  that  the  two  estimates  pg .  and 
ps2'  Provlde<:l  the  isotonic  matching  strategy  and  the  Mahalanobis- 
distance  based  strategy,  respectively  have  nearly  identical  summary 
statistics.  In  fact,  an  examination  of  all  the  results  showed  that. 

m  * 

for  all  values  of  n  and  X  in  our  study,  the  estimates  pgl  and  pg2 
were  the  same  for  most  of  the  realizations  of  U.  figures  3.1  and  3.2 
provide  the  empirical  evidence  of  this  fact. 

Now  we  shall  discuss  our  results  in  the  case  of  conditional 
independence.  As  noted  in  Section  3.2,  the  maximum  livelihood 

estimator  of  pxy  under  this  model,  whereas  the  method  of 

moments  estimator  based  on  paired-data,  is  computed  for  comparison 
purposes.  As  expected,  ®hd  behave  equally  well  on  the 

average  even  though  the  estimated  standard  error  of  is  consis¬ 
tently  higher  than  that  of  furthermore  the  ranges  of 


I 


are  consistently  larger  than  those  of  p  (see  Tables  3.1,  3.3  and 

ml  2 


3.5). 


For  low  correlation  and  each  n,  p  ,  p  and  p  compare  well 

8  JL  Be  83 


with  the  estimates  p  or  p  ..  as  far  as  the  averages  are  concerned 

mu  mi  2 


(soe  Tables  3.1,  3.3  and  3.5).  However,  the  synthetic  data  estimators 


have  larger  variation  than  Pml2*  as  shown  in  Fig*  3.3  -  Fig.  3.5. 


Furthermore,  all  the  synthetic  data  estimators  have  variation 


comparable  to  that  of  as  shown  In  Fig.  3.6  -  Fig.  3.8. 


For  medium  and  high  values  of  p  ,  all  three  synthetic  estima- 

xy 


tors  exhibit  some  amount  of  negative  bias  with  regard  to  both  p 


mil 


and  Al3°*  ps3»  the  estimator  given  by  the  method  of  matching 


in  bins,  13  more  negatively  biased  than  pgl  and  pg^.  Tables  3.1,  3.3 


and  3.5,  Fig.  3.9  -  Fig.  3.14  Illustrate  these  points.  Again,  pg3  is 


worse  than  pgl  and  pg2  .  These  patterns  among  the  five  estimates 


exist  for  any  sample  size  even  though  the  difference  between 


synthetic  data  estimators  and  pffl^2  ten<*s  to  decrease  as  n  increases. 


Turning  to  the  conditional  positive  dependence  case,  we  first 


note  that  is  a  reasonable  estimator  of  Pxy*  even  though  It  would 


not  be  available  to  the  practitioner.  On  comparing  with  the 


synthetic  data  estimators  p  ,,  p  and  p  .  and  p  we  find 

si  sZ  s3  mi  2 


that  these  estimators  perform  very  badly,  in  that  all  of  them  are 
consistently  underestimates  and  therefore  heavily  negatively  biased 
(See  Tables  3.2,  3.4,  3.6  and  3.7  and  Fig.  3.15). 


For  each  n,  and  low  or  medium  choices  of  p  ,  the  synthetic  data 

xy 


estimators  are  comparable  to  Pmfc2,  whereas  for  high  values  of  pxy. 


I 


! 
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the  three  synthetic  data  estimators  have  a  definite  negative  bias 


compared  with  p  Tables  3.2,  3.4,  3.6  and  3.7  and  Fig.  3.16  - 

Fig.  3.19  support  this  conclusion.  Furthermore  it  is  observed  that 

pg3  based  on  binning  is  worse  than  pgl  (pg2)  as  illustrated  by 

Fig.  3.20.  However,  the  difference  between  the  average  p  .  _  and 

ml  i. 

pg^,  i  =  1,2,3  tends  to  decrease  as  n  increases. 

Finally  it  must  be  pointed  out  that  as  the  positive  dependence 

increases;  ie.p  -p  p  increases,  the  bias  in  the  three 
xy  xzyz 

synthetic  data  estimators  and  p^^  increases.  Tables  3.4  and  3.7 
illustrate  this  fact. 


Based  on  these  observations,  we  must  conclude  that  when 

conditional  Independence  model  holds,  the  synthetic  data  estimators 

do  not  provide  any  advantage  over  PmjL2,  the  no-matching  estimator. 

In  fact,  they  are  slightly  worse  than  the  pml2.  On  the  other  hand, 

in  the  case  of  conditional  positive  dependence,  p  and  all  the 

mi  i 

synthetic  data  estimators  perform  badly,  the  performance  of 


synthetic  data  estimators  being  slightly  worse  than  that  of  Pm^2- 
Thus  estimators  based  on  matching  strategies  do  not  seem  to  provide 


any  advantage  over  the  estimators  based  on  the  assumption  of 
conditional  Independence  and  no  matching.  Thus  for  estimating  p 

xy 

in  Case  III  models,  the  extra  work  involved  in  matching  data  files 
is  almost  worthless.  Further  studies  are  in  order  for  much  larger 
sample  sizes  to  examine  if  this  picture  changes  at  all.  We  should 
point  out  that  it  is  possible  that  matching  may  be  useful  for 
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Table  3.1  Summary  Statistics  of  Sample 
Correlations  -  Files  with  n«10  Records 
Conditional  Independence  Case 


1 


B 


1 

I 

I 


I 


K 

I 


px* 

pn 

0 

xy 

pmll 

pmt2 

pal 

Ps2 

pa3 

0.0149 

-0.0032 

-0.0101 

0.0100 

0.0114 

(0.3384) 

(0.1127) 

(0.3296) 

(0.3297) 

(0.3212) 

0.00 

0.10 

0.00 

-0,8170 

-0.5844 

-0.7575 

•0.7575 

0.8306 

0.8472 

0.4675 

0.8590 

0.8590 

0.7708 

0 . 879 

0.5794 

0.5457 

0.5457 

0.3103 

(0.2212) 

(0.2006) 

(0.2337) 

(0.2337) 

(0.2396) 

0.92 

0.65 

0.60 

-0.6523 

-0.4040 

-0.6058 

•0.6038 

-0.6038 

0.9753 

0.9431 

0.9626 

0.9626 

0.9681 

0.6830 

0.6638 

0.6150 

0.6131 

0.3748 

(0.1986) 

(0.1728) 

(0.2087) 

(0.2086) 

(0.2230) 

0  93 

0.75 

0.70 

-0.3369 

0.1437 

0.3115 

0.3113 

0.3396 

0.9936 

0.9609 

0.9576 

0.9376 

0.9696 

1 

R 

R 
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Table  3.1  (Cont'd. > 


pxz 

V 

pxy 

pmU 

pmi,2 

p8l 

p82 

PS3 

0.7863 

0.7775 

0.7302 

0.7302 

0.6874 

(0.1445) 

(0.1182) 

(0.1522) 

(0.1522) 

(0.1731) 

0.94 

0.B5 

0.80 

-0.3432 

0.2058 

-0.2367 

-0.2367 

-0.2367 

0.9879 

0.9566 

0.9799 

0.9799 

0.9723 

0.8937 

0.8901 

0.8252 

0.8251 

0.7789 

(0.0764) 

(0.0625) 

(0.0994) 

(0.0995) 

(0.1236) 

0.95 

0.95 

0.90 

0.3247 

0.3508 

0.3821 

0.3821 

0.1796 

0.9949 

0.9814 

0.9850 

0.9850 

0.9725 

0.9448 

0.9421 

0.8758 

0.8760 

0.8238 

<0.0419)  (0.0317)  (0.0741)  (0.0741)  (0.1063) 

0.5329  0.7364  0.5027  0.5027  0.2123 

0.9973  0.9910  0.9898  0.9898  0.9868 


0.97  0.97  0.95 
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Table  3.2  Summary  Statistics  of  Sample 
Correlations  -  Files  with  n=10  Records 
Conditional  Positive  Dependence  Case 


pxz 

O 

xy 

pmll 

pmi2 

psl 

Ps2 

Ps3 

0.9413 

-0.0046 

-0.0289 

0.0395 

-0.0153 

(0.0474) 

(0.1142) 

(0.3310) 

(0.3327) 

(0.3269) 

0.00 

0.10 

0.95 

0.5942 

-0.5723 

-0.8425 

-0.8525 

-0.8962 

0.9959 

0.5302 

0.8897 

0.8897 

0.8181 

0.8676 

0.5729 

0.5276 

0.5108 

0.4919 

(0.0885) 

(0.2021) 

(0.2403) 

(0.2443) 

(0.2483) 

0.92 

0.65 

0.88 

0.2744 

-0.5510 

-0.6166 

-0.6248 

-0.6119 

0.9914 

0.9407 

0.9621 

0.9621 

0.9621 

0.9103 

0.6771 

0.6310 

0.6262 

0.5834 

(0.0666) 

(0.1617) 

(0.2018) 

(0.2050) 

(0.2085) 

0.93 

0.75 

0.92 

0.4811 

-0.2063 

-0.3529 

-0.3529 

-0.2667 

0.9918 

0.9448 

0.9722 

0.9722 

0.9892 
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Table  3.2  (Cont’d.) 


* 

• 

« 

. 

pyz 

P 

xy 

*n»ll 

*m8.2 

pal 

*32 

ps3 

0.9558 

0.7741 

0 . 7188 

0.7165 

0.6687 

(0.0353) 

(0.1153) 

(0.1573) 

(0.1578) 

(0.1781) 

0.94 

0.85 

0.96 

0.6288 

0.2202 

-0.2325 

-0.2325 

-0.1806 

0.9960 

0.9798 

0.9707 

0.9707 

0.9535 

0.9775 

0.8871 

0.8225 

0.8211 

0.7770 

(0.0177) 

(0.0640) 

(0.1036) 

(0.1040) 

(0.1231) 

0.95 

0.95 

0.98 

0.8491 

0.4165 

0.2546 

0.2546 

0.0215 

0.9986 

0.9783 

0.9922 

0.9922 

0.9727 

0.9888 

0.9439 

0.8770 

0.8774 

0.8258 

(0.0088) 

(0.0329) 

(0.0760) 

(0.0755) 

(0.1039) 

0.9V 

0.97 

0.99 

0.9184 

0.6081 

0.4432 

0.4432 

0.3541 

0.9992 

0.9919 

0.9894 

0.9894 

0.9857 
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I 

| 

“  Table  3.3  Summary  Statistics  of  Sample 

Correlations  -  Piles  with  n=25  Records 
I  Conditional  Independence  Case 


i 

1 

I 

i 

I 


'xz 

V 

P 

xy 

'mil 

'ml  2 

'si 

K  i 

's3 

-0.0068 

0.0001 

-0.0025 

-0.0026 

-0.0040 

(0.2059) 

(0.0479) 

(0.2013) 

(0.2014) 

(0-2008) 

0.00 

0.10 

0.00 

-0.6576 

-0.2851 

-0.5749 

-0.5749 

-0.6980 

0.5450 

0.2501 

0.6196 

0.6196 

0.5087 

0.5915 

0.5788 

0.5568 

0.5564 

0.5171 

(0.1336) 

(0.1231) 

(0.1365) 

(0.1365) 

(0.1476) 

0.92 

0.65 

0.60 

-0.0576 

-0.0890 

0.0259 

0.0259 

-0.0468 

0.8704 

0.8189 

0.8663 

0.8663 

0.8096 

0.6859 

0.6859 

0.6620 

0.6627 

0.6111 

(0.1087) 

(0.0935) 

(0.1096) 

(0.1097) 

(0.1216) 

0.93 

0.75 

0.70 

0.2953 

0.2697 

0.1828 

0.1828 

0.1642 

0.9022 

0.8959 

0.8955 

0.8955 

0.8973 
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Table  3.3  (Cont'd.) 


PX2 

P 

y  2 

P 

*xy 

'rail 

Pm%2 

'sl 

's2 

's3 

0.7993 

0.7934 

0 . 7644 

0.7643 

0.7129 

(0.0754) 

(0.0617) 

(0.0789) 

(0.0790) 

(0.0964) 

0.94 

0 .35 

0.80 

0.4274 

0.4778 

0.4617 

0.4617 

0.2724 

0.9380 

0.9087 

0.9139 

0.9139 

0.9241 

0.8967 

0.8961 

0.8648 

0.8643 

0.8049 

(0.0416) 

(0.0313) 

(0.0473) 

(0.476) 

(0.0676) 

0.95 

0.95 

0.90 

0.7057 

0.7592 

0.6580 

0.6580 

0.4614 

0.9753 

0.9636 

0.9632 

0.9632 

0.9297 

0.9479 

0.9473 

0.9117 

0.9123 

0.8485 

(0.0211) 

(0.0154) 

(0.0327) 

(0.0326) 

(0.0605) 

0.97 

0.97 

0.95 

0.8446 

0.8638 

0.^636 

0.7636 

0.5102 

0.9874 

0.9755 

0  35 

0.9735 

0.9519 
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Table  3.4  Summary  Statistics  of  Sample 
Correlations  -  Files  with  n=25  Records 
Conditional  Positive  Dependence  Case 


pxz 

yz 

0 

xy 

pmtl 

pmt2 

psl 

PS2 

Ps3 

0.9475 

-0.0019 

0.0058 

-0.0372 

-0.0004 

(0.0222) 

(0.0439) 

(0.2061) 

(0.2038) 

(0.1989) 

0.00 

0.1C 

0.95 

0 . 8249 

-0.281/ 

-0.5665 

-0.5480 

-0.7596 

0.9857 

0.1963 

0.6964 

0.6964 

0.5557 

0.8758 

0-5857 

0.5643 

0.5149 

0.5277 

(0.0503) 

(0.1207) 

(0.1331) 

(0.1436) 

(0.1425) 

0.92 

0.65 

0.88 

0.6051 

0.1442 

U . 1621 

0.0617 

0.0404 

0.9738 

0.8344 

0.8896 

0.8896 

0.8512 

0.9143 

0.6907 

0.6627 

0.6489 

0  .6190 

(0.0361) 

(0.0851; 

(0.1058) 

(0.1093) 

(0.1125) 

0.93 

0.75 

0.92 

0.6844 

0.2967 

0.2949 

0.2641 

0.1829 

0.9774 

0.8876 

0.8661 

0.8642 

0.9020 

pap 

xz  yz  xy 


0.94  0.85  0.96 


0.95  0.95  0.98 


0.97  0,97  0.99 
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Table  3.4  (Cont'd. ) 


pmM 

* 

pmt2 

psl 

ps2 

pe3 

0.9578 

0.7931 

0.7641 

0.7539 

0.7127 

(0.0174) 

(0.0624) 

(0.0832) 

(0.0853) 

(0.0948) 

0.8756 

0.5449 

0.3612 

0.3647 

0.3425 

0.9893 

0.9226 

0.9181 

0.9174 

0.9128 

0.9792 

0.8956 

0.8614 

0.8543 

0.7998 

(0.0096) 

(0.0308) 

(0.0496) 

(0.0516) 

(0.0691) 

0.9131 

0.7693 

0.6315 

0.6226 

0.5157 

0.9959 

0.9661 

0.9647 

0.9647 

0.9413 

0.9895 

0.9475 

0.9123 

0.9139 

0.8499 

(0.0042) 

(0.0158) 

(0.0339) 

(0.0336) 

(0.0584) 

0.9685 

0.8769 

0.7182 

0.7352 

0.5685 

0.9972 

0.9833 

0.9769 

0.9849 

0.9773 

1 

I 
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Table  3.5  Summary  Statistics  of  Sample 
Correlations  -  files  with  n-50  Records 
Conditional  Independence  Case 


0xz 

0yz 

0xy 

Pmll 

0ml2 

Os  1 

0s3 

-0.0004 

-0.0003 

0.0019 

0.0044 

(0.1436) 

(0.0242) 

(0.1474) 

(0.1445) 

0.00 

0.10 

0.00 

-0.4381 

-0.1663 

-0.4872 

-0.5205 

0.4746 

0.1244 

0.4398 

0.4574 

0.5936 

0.5952 

0,5823 

0.5391 

(0,0916) 

(0.0704) 

(0.0909) 

(0.0959) 

0.92 

0.65 

0,60 

0,2530 

0.2219 

0.2242 

0.1098 

0.8377 

0.8103 

0.7998 

0.7873 

0,6950 

0.6953 

0 . 6807 

0.6279 

(0.0756) 

(0.0612) 

(0.0709) 

(0.0815) 

0.93 

0.75 

0.70 

0.2796 

0.3696 

0.3760 

0.2526 

0.8768 

0.8426 

0.87)8 

0.8543 

I 

I 
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T&bla  3.5  (Cont'd.) 


t >xa 

Pya 

Pxy 

Pmll 

Pml2 

P*1 

Pfl3 

0.7959 

0.7974 

0.7797 

0.7198 

(0.0528) 

(0.0408) 

(0.0327) 

(0.0645) 

0.94 

0.85 

0.80 

0.5689 

0.5664 

0.4919 

0.4531 

0.9204 

0.9082 

0.9222 

0.8821 

0.8982 

0.8978 

0.8778 

0.8110 

(0.0289) 

(0.0200) 

(0.0306) 

(0.0493) 

0.95 

0.95 

0.90 

0.7132 

0.7845 

0.7331 

0.6079 

0.9634 

0.9467 

0.9393 

0.9149 

0.9486 

0.9490 

0.9276 

0.8559 

(0.0151) 

(0.0103) 

(0.0199) 

(0.0419) 

0.9? 

0.97 

0.95 

0.8549 

0.9100 

0.8039 

0.6329 

0 . 9808 

0.9743 

0.9761 

0.9576 

Table  3.6  Summary  Statistics  of  Sample 
Correlations  -  Files  with  n*50  Records 
Conditional  Positive  Dependence  Case 


PXX 

Py* 

Pxy 

Pmll 

Pml2 

P  Si 

Ps3 

0.9991 

0.0001 

0.0013 

0.0023 

(0.0148) 

(0.0243) 

(0.1473) 

(0.1427) 

0.00 

0.10 

0.93 

0.8700 

-0.1447 

-0.3236 

-0.3137 

0.9828 

0.1306 

0.4727 

0.3143 

0.8776 

0.3934 

0.3809 

0.3358 

(0.0336) 

(0.0817) 

(0.0928) 

(0.0981) 

0.92 

0.63 

0 . 88 

0.6908 

0.2791 

0.1319 

0.1593 

0.9576 

0.8031 

0.8181 

0.8338 

0.9183 

0.6944 

0.6771 

0.6257 

(0.0223) 

(0.0638) 

(0.0752) 

(0.0834) 

0.93 

0.75 

0.92 

0.8119 

0.4028 

0.3506 

0.2950 

0.9698 

0.8628 

0.8599 

0.8595 
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Table  3.6 

( Cont  *  d . ) 

! 

1 

Pxz 

pyz 

Pxy 

infill 

hnl2 

*>sl 

Ps3  | 

0.9595 

0.7967 

0.7803 

» 

0.7198  j 

(0.0116) 

(0.0415) 

(0.0512) 

(0.0627)  * 

0.94 

0.85 

0.96 

0.8793 

0.6023 

0.5699 

0.3595  r 

0.9853 

0.8960 

0.9158 

0.8824  ) 

1 

1 

0.9794 

0.8973 

0.8776 

r 

0.8106  jj 

(0.0061) 

(0.0200) 

(0.0294) 

(0.0468) 

0.95 

0.95 

0.98 

0.9390 

0.8096 

0.7596 

0.6273  j 

1 

0.9932 

0.9506 

0.9570 

0.9279 

1 

i 

i 

0.9898 

0.9492 

0.9281 

0.8555 

1 

I 

(0.0029) 

(0.0107) 

(0.0200) 

(0.0426)  . 

!  097 

0.97 

0.99 

0.9736 

0.8927 

0.8181 

0.6501  1 

i 

0.9964 

0.9757 

0.9713 

0.9555 

! 

( 

t 


8 


I 

1 


I 

1 
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Table  3.7  Summary  Statistics  of  Sample 
Correlations  -  Files  with  n=25  Records 
Conditional  Positive  Dependence  Case 


A 

• 

A 

A 

pxz 

P 

xy 

pmll 

pml2 

psl 

ps2 

ps3 

0.4933 

0.0008 

-0.0027 

-0.0063 

0.0012 

(0.1574) 

(0.0451) 

(0.2117) 

(0.2105) 

(0.2044) 

0.00 

0.10 

0.50 

-0.0632 

-0.1632 

-0.6421 

-0.6421 

-0.0035 

0.8777 

0.1976 

0.6186 

-0.6186 

0.5807 

0.7425 

0.5876 

0.5655 

0.5622 

0.5236 

(0.0940) 

(0.1108) 

(0.1292) 

(0.1301) 

(0.1430) 

0.92 

0.69 

0.79 

0.2986 

0.1141 

-0.0065 

-0.0065 

0.0205 

0.9390 

0.8326 

0.8621 

-0.8621 

0.8285 

0.7943 

0.6919 

0.6683 

0.6691 

0.6249 

(0.0762) 

(0.0889) 

(0.1109) 

(0.1102) 

(0.1180) 

0.93 

0.75 

0.80 

0.3982 

0.3129 

0.1844 

0.1844 

0.2023 

0.9373 

0.8978 

0.9047 

0.9047 

0.8853 
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Figure  3.9  Isotonic  vs.  Nomatchlng. 

Px 2  “  0-93,  pyz  =  0.75,  pxy  *  0.70,  n  =  25. 
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Figure  3.10  Mahalanobis  vs.  Nomatching. 
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Figure  3.16  Isotonic  vs.  Nomatching. 
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Figure  3.17  Hatching  in  Bins  vs.  Nooatching. 

Pxz  3  0.93.  Pyz  3  0.75,  pXy  =  0.80,  n  =  25. 
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